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Introduction to Algorithms 


Throughout much of the documentation, we avoid detailed discussion of the inner workings of 
procedures in order to promote readability. This algorithms document is designed as a resource 
for those interested in the specific calculations performed by procedures. 


Algorithms Used in Multiple Procedures 


For some statistics, such as the significance of a t test, the same algorithms are used in more than 
one procedure. Another example is the group of post hoc tests that are used in ONEWAY and GLM. 
You can find algorithms for these tests in the appendixes. 


Choice of Formulas 


Starting with the first statistics course, students learn that there are often several equivalent ways 
to compute the same quantity. For example, the variance can be obtained using either of the 
following formulas: 


y N’ 
s°= (x; —£)*/(N -1) 
i=] 
N N14 
oe Soa? - BE? /N ]/(N-1) 


t= 1 1=1 


Since the formulas are algebraically equal, the one preferred is often the one easier to use (or 
remember). For small data sets consisting of “nice” numbers, the arbitrary choice of several 
computational methods is usually of little consequence. However, for handling large data sets 
or “troublesome” numbers, the choice of computational algorithms can become very important, 
even when those algorithms are to be executed by a computer. Care must be taken to choose an 
algorithm that produces accurate results under a wide variety of conditions without requiring 
extensive computer time. Often, these two considerations must be balanced against each other. 


You may notice that the same statistic is computed differently in various routines. Among the 
reasons for this are the precision of the calculations and the desirability of computing many 
statistics in the same procedure. For example, in the paired t test procedure (T-TEST), the need 
to compute both the correlation coefficient and the standard error of the difference led to the 
selection of a different algorithm than would have been chosen for computation of only the 
standard error. Throughout the years of development, the personal preferences of many designers 
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and programmers have also influenced the choice of algorithms. Now, as new routines are added 
and old ones are updated, any unnecessary diversity is being replaced by a consistent core 
of algorithms. 


Missing Values 


Since similar options for treatment of missing values are available in many procedures, treatment 
of missing values has often not been specified in each chapter. Instead, the following rules 
should be used: 


m If listwise deletion has been specified and a missing value is encountered, the case is not 
included in the analysis. This is equivalent, for purposes of following the algorithms, to 
setting the case weight to zero. 


m If variable-wise deletion is in effect, a case has zero weight when the variable with missing 
values is involved in computations. 


m If pairwise deletion has been specified, a case has zero weight when one or both of a pair of 
variables included in a computation is missing. 


m If missing-values declarations are to be ignored, all cases are always included in the 
computation. 


It should be noted again that the computer routines do not alter case weights for cases with missing 
data but, instead, actually exclude them from the computations. Missing-value specifications do 
not apply when a variable is used for weighting. All values are treated as weights. 
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Notation 


Model 


2SLS produces the two-stage least-squares estimation for a structure of simultaneous linear 


equations. 


The following notation is used throughout this chapter unless otherwise stated: 


Number of predictors 
Number of endogenous variables among p predictors 
Number of non-endogenous variables among p predictors 


Number of instrument variables 


nx1 vector which consists of a sample of the dependent variable 


nxp matrix which represents observed predictors 


p 

P1 

P2 

k 

n Number of cases 
y 

Z 

B px1 parameter vector 
xX 


nx1 matrix with element xjj, which represents the observed value of the 


j'h instrumental variable for case i. 


Z1 Submatrix of Z with dimension nxpj, which represents observed endogenous 
variables 

Z2 Submatrix of Z with dimension nxp2, which represents observed 
non-endogenous variables 

Bi Subvector of B with parameters associated with Z 

Bo Subvector of B with parameters associated with Zo 


The structure equations of interest are written in the form: 


y=Z8= (21, 2a)) 3) fF e€ 


P2 

Zi = ».G t 6 
where 

, |B 
Z= [Z1, Zo], b= E | 
and ¢ and 6 are the disturbances with zero means and covariance matrices o7I,, and ¢’I,,, 
respectively. 

Estimation 


The estimation technique used was developed by Theil; (Theil, 1953), (Theil, 1953). First 


premultiply both sides of the model equation by X’ to obtain 


Xy=XZ34+Xe 
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Since the disturbance vector has zero mean and_ covariance matrix co? ( xX xX), then 


; ak ; —i 
x’X) *X’e would have a covariance matrix 77I,,. Thus, multiplying (x'x) * to both sides 
of the above equation results in a multiple linear regression model 


st sel 
2 2 


ex) yee) X'Z5 + (X'X) X'e 


The ordinary least-square estimator 33 for ( is 


b= (z'x(x'x) “x'2) “2'x(x'x) xy 


Computational Details 


E 2SLS constructs a matrix R, 


1 Vv 
BS B ul 


where 


gos Cee(Caz) SC x 
{= ) aa 
= C.,(Cr2) C ry 
and C.,. is the correlation matrix between Z and X, and C.,.. is the correlation matrix among 
instrumental variables. 
E Sweep the matrix R to obtain regression coefficient estimate for 3. 
—E Compute sum of the squares of residuals (SSE) by 


yy uZ y y Zu LuZ Zu 


where 
u= y'x(X'X) | Xz lex(x'x) X7| - 


— Compute the statistics for the ANOVA table and for variables in the equation. For more 
information, see the topic “REGRESSION Algorithms”. 
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Procedures ACF and PACF print and plot the sample autocorrelation and partial autocorrelation 
functions of a series of data. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table 3-1 
Notation 
Notation Description 
Li ith observation of input series, i=1,...,n 
rk kth lag sample autocorrelation 
Onn kth lag sample partial autocorrelation 


Basic Statistics 


The following formulas are used if no missing values are encountered. If missing values are 
present, see “Series with Missing Values” for modification of some formulas. 


Sample Autocorrelation 


n—-k 
So (ei = F)(wi4n - 7) 


n 


(2; = x)? 


w=1 


where is the average of the n observations. 


Standard Error of Sample Autocorrelation 


There are two formulas for the standard error of 7; based on different assumptions about 
the autocorrelation. Under the assumption that the true MA order of the process is k—1, the 
approximate variance of r;.(Bartlett, 1946) is: 


k-1 
varrsh (1 { | 


l=1 


The standard error is the square root (Box and Jenkins, 1976), p. 35. Under the assumption that 
the process is white noise, 


~ 1 —k 
var(r)=4 (#24) 
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Box-Ljung Statistic 


At lag k, the Box-Ljung statistic is defined by 


k 
Q, = n(n 2 ; 
n 

l=1 


When n is large, Q;, has a chi-square distribution with degrees of freedom k—p—q, where p and q 
are autoregressive and moving average orders, respectively. The significance level of Q;, is 
calculated from the chi-square distribution with k-p—q degrees of freedom. 


| |r 


l 


Sample Partial Autocorrelation 


Ou =T1 


Onj = Ok-1,j _ Onk@Ok—-1 bj =) ARP 1h a oe 


k—-1 k-1 
Okk = [> ms oS rrsn] / ( _ ia nis) kK eee 
j=l j=l 


Standard Error of Sample Partial Autocorrelation 
Under the assumption that the AR(p) model is correct and p < k — 1, 
Onn = N(0,+)(Quenouville, 1949) 


Thus 
aed ~ 1 
var (xx) 7 


Series with Missing Values 


If there are missing values in x, the following statistics are computed differently (Cryer, 1986). 
First, define 


@ = average of nonmissing x7),...,27,,, 


; r4—-2, if x; is not missing 
“1 SYSMIS, if x; is missing 


for k=0,1,2,..., and j=1,...,n 


ph) — J @i@jtes if both are not missing 
”i ~~) SYSMIS, otherwise 


J 


(k) p(*) 


my, = the number of nonmissing values in b'’,..., 


mg = the number of nonmissing values in x 


Sample Autocorrelation 


sum of nonmissing ne ro 
sum of nonmissing 6{"’,....0.” 


Th = 


Standard Error of Sample Autocorrelation 


k-1 
se(r,) = a ( >> 2) (MA assumption) 


se(rp) = ,/—™ (white noise) 


V (mo+2)m 


Box-Ljung Statistic 


ko 


Q = mo(mo + 2) — 
my 
l=1 


Standard Error of Sample Partial Autocorrelation 


oe re ae 
se( nr ) = \V io 
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AIM Algorithms 


The Attribute Importance (AIM) procedure performs tests to find out if the groups are 
homogeneous. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 4-1 
Notation 
Notation Description 
G Number of groups. 
Cc Number of categories in the categorical variable. 
Ni; Number of cases in the jth category in the ith group, i= 1, ..., G andj = 1, 
..., C, Assume that n;; > 0. 
me Number of cases in the ith group. n; = op LNij 
n Overall number of cases. n = Den. Assume n>0. 
Pj Overall proportion of cases in the jth category.p; = +" ni, 
Qi Mean of the continuous variable in the ith group. 
Si Standard deviation of the continuous variable in the ith group. Assume 
that s; > 0. 
z G 
Overall mean of the continuous variable. 7 — 1 » niki 
w=1 


Test of Homogeneity of Proportions 


This test is performed only for categorical variables. The null hypothesis is that the proportion 
of cases in the categories in the ith group is the same as the overall proportion. If C > 1, the 
Chi-square statistic for the ith group is computed as follows: 


C ND 
= S~ (nij — Nipj) 
1 


Nip; 
The degrees of freedom is C—1. The significance is the probability that a Chi-square random 
variate with this degrees of freedom will have a value greater than the \° statistic. 


If C<1, the Chi-square statistic is always 0 with zero degrees of freedom, and the significance 
value is undefined. 
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Test of Equality of Means 


This test is performed only for continuous variables. The null hypothesis is that the mean (of a 
continuous variable) in the ith group is the same as the overall mean. If n; > 1 and s; > 0, the 
Student’s t statistic for the ith group is computed as follows: 


(fez) 


The degrees of freedom is n; — 1. The significance is the probability that a Student’s t random 
variate with this degrees of freedom will have a value greater than the t statistic. 


t 


When n; > 1 but s; = 0, this implies that the continuous variable is constant in the ith group. In 
this case, the Student’s ¢ statistic is infinity with positive degrees of freedom n; — 1, and the 
significance value is zero. 


If n; < 1, then s; is undefined. In this case, the Student’s t statistic is undefined, the degrees of 
freedom is 0, and the significance value is undefined. 


Graphical Display of Significance 


Since significance values are often very small numbers, the negative common logarithm —log ,,) 
of significance values are displayed instead in the bar charts. 
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ALSCAL attempts to find the structure in a set of distance measures between objects or cases. 


Initial Configuration 


The first step is to estimate an additive constant c;, which is added to the observed proximity 
measures (for example, 0; ;;,). Thus, 


OF ik — Oijk + Ck 
such that for all triples the triangular inequality holds: 


* * 4 
ijk + OF1K > Oilk 


and positivity holds 0;;,, > 0 


where 

Table 5-1 
Notation 

Notation Description 

Oj;k is the adjusted proximity between stimulus i and stimulus j for subject k 
OK is the adjusted proximity between stimulus j and stimulus | for subject k 
Oink is the adjusted proximity between stimulus i and stimulus | for subject k 


The constant c;,, which is added, is as small as possible to estimate a zero point for the dissimilarity 
data, thus bringing the data more nearly in line with the ratio level of measurement. This step 


is necessary to make the B;. matrix, described below, positive semidefinite (that is, with no 
imaginary roots). 


The next step is to compute a scalar product matrix B;* for each subject k by double centering 
O;, the adjusted proximity matrix for each subject. An element of the B;.” matrix 7}, is 
computed as follows: 


b** —_ if ( *2 *2 5*2 | #2) 
ijk ~~ 5 \Oigk ~ Pik ~ Ogk Ok 
where 
Table 5-2 
Notation 
Notation Description 
Ol k are the row means for the adjusted proximities for subject k 
OnF are the column means for the adjusted proximities for subject k 


o*,, is the grand mean for subject k 
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Double centering to convert distances to scalar products is necessary because a scalar products 
matrix is required to compute an initial configuration using the Young-Householder-Torgerson 
procedure. 


Next the individual subject matrices are normalized so that they have the same variance. The 
normalized matrix B;. is found for each subject. The elements of the matrix are 


where n is the number of stimuli, and (nm — 1) is the number of off-diagonal elements in the 
B;* matrix. The denominator is both the root mean square and the standard deviation of the 
unnormalized scalar products matrix B** (It is both because b*;, = 0, due to double centering.) 
B,, is thus a matrix with elements b;;,, which are scalar products for individual subject  k. 
Normalization of individual subjects’ matrices equates the contribution of each individual to the 
formation of a mean scalar products matrix and thus the resulting initial configuration. 


Next an average scalar products matrix B* over the subjects is computed. The elements of this 
matrix are 


So bin 
Pn = oi 


a 
ys m 


where m is the number of subjects. 


The average B* matrix used in the previous step is used to compute an _ initial stimulus 
configuration using the classical Young-Householder multidimensional scaling procedure 


B* =xXxX 


where X is an n x r matrix of n stimulus points on r dimensions, and X is the transpose of the X 
matrix; that is, the rows and columns are interchanged. The X matrix is the initial configuration. 


For the weighted ALSCAL matrix model, initial weight configuration matrices W;, for each of 
the m subjects are computed. The initial weight matrices W,;, are matrices, where r is the 
number of dimensions. Later the diagonals of W;, will form rows of the W matrix, which is an 

nm X r matrix. The matrices W;, are determined such that B; = YW.Y, where Y = XT and 
TT =Iand where T is an orthogonal rotation of the configuration X to a new orientation Y. T is 
computed by the Schénemann-de Leeuw procedure discussed by Young, Takane, and Lewyckyj 
(Young, Takane, and Lewyckyj, 1978). T rotates X so that W;, is as diagonal as possible (that is, 
off-diagonal elements are as close to zero as possible on the average over subjects). Off-diagonal 
elements represent a departure from the model (which assumes that subjects weight only the 
dimensions of the stimulus space). 
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Optimization Algorithm 


The optimization algorithm is a series of steps which are repeated (iterated) until the final solution 
is achieved. The steps for the optimization algorithm are performed successively because 
disparities, weights, and coordinates cannot be solved for simultaneously. 


Distance 


Distances are computed according to the weighted Euclidean model 


r 


2 . .. \2 
dik = > WkalLia — 4 ja) 


a= | 


where 
Table 5-3 
Notation 
Notation Description 
Wha is the weight for subject k on a dimension a, 


Tia is the coordinate of stimulus i on dimension a, 


Tja is the coordinate of stimulus j on dimension a. 


The first set of distances is computed from the coordinates and weights found in the previous 
steps. Subsequently, new distances are computed from new coordinates found in the iteration 
process (described below). 


Optimal Scaling 


Optimal scaling for ordinal data use Kruskal’s least-squares monotonic transformation. This yields 
disparities that are a monotonic transformation of the data and that are as much like the distances 
(in a least squares sense) as possible. Ideally, we want the distances to be in exactly the same rank 
order as the data, but usually they are not. So we locate a new set of numbers, called disparities, 
which are in the same rank order as the data and which fit the distances as well as possible. When 
we see an order violation we replace the numbers that are out of order with a block of values that 
are the mean of the out-of-order numbers. When there are ties in the data, the optimal scaling 
process is changed somewhat. Kruskal’s primary and secondary procedures are used in ALSCAL. 


Normalization 


The disparities computed previously are now normalized for technical reasons related to the 
alternating least squares algorithm (Takane, Young, and de Leeuw, 1977). During the course of 
the optimization process, we want to minimize a measure of error called SSTRESS. But the 
monotone regression procedure described above only minimizes the numerator of the SSTRESS 
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formula. Thus, the formula below is applied to readjust the length of the disparities vector so 
that SSTRESS is minimized: 


DY = Dj, (Dy'Dx) (Dy'Di) 


where 

Table 5-4 

Notation 

Notation Description 
D; is a column vector with mn) elements containing all the disparities for subject k, 
Dy is a column vector with + elements containing all the distances for subject k, 
D,,’D; is the sum of the squared distances, 
D,,’D;. is the sum of the cross products. 


The normalized disparities vector D;’ is a conditional least squares estimate for the distances; 
that is, it is the least squares estimate for a given iteration. The previous D* values are replaced by 
D** values, and subsequent steps utilize the normalized disparities. 


SSTRESS 


The Takane-Young-de Leeuw formula is used: 
2 «2 \2 
mn >. (ain — die) 
1 i gf 


ssTRESS(i) =3=.|—)- 


1 +4 
eA yD ait 
ij 


1/2 


where d/;,, values are the normalized disparity measures computed previously, and d;;;, are 
computed as shown above. Thus SSTRESS is computed from the normalized disparities and 
the previous set of coordinates and weights. 


Termination 


The current value of SSTRESS is compared to the value of SSTRESS from the previous iteration. 
If the improvement is less than a specified value (default equals 0.001), iteration stops and the 
output stages has been reached. If not, the program proceeds to the next step. (This step is skipped 
on the first iteration.) 
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Model Estimation 
In ALSCAL the weights and coordinates cannot be solved for simultaneously, so we do it 
successively. Thus, the model estimation phase consists of two steps: (i) estimation of subject 


weights and (ii) estimation of stimulus coordinates. 


(i) Estimation of subject weights. (This step is skipped for the simple, that is, 
unweighted, Euclidean model.) 


A conditional least-squares estimate of the weights is computed at each iteration: 
~ * fi a 1 

W=D'P(P P) 

The derivation of the computational formula is as follows: 


We have found disparities such that 


* me. 2 
dijp=Aijks 
where 
- 2 
wd 
disk = ) Whal Lia = a7) 
a=1 


Let p;;, be the unweighted distance between stimuli i and j as projected onto dimension a, that is, 


2 
Pija = (Via i Lia) . 
Then 
‘ 
62: 22 _ . 
di, dit = ) WhkaDija: 


a=1 


n(n—1) 


In matrix notation, this is expressed as D* = WP, where D* is now an m x matrix 
having one row for every subject and one column for each stimulus pair; W is an m x r matrix 
having one row for every subject and one column for each dimension; and P’ has one row for 
every dimension and one column for every stimulus pair. 
We wish to solve for W, WP =D*, which we do by noting that 

/ f , -1 f , =1 
WPP(P'P)) =D'P(PP) - 


Therefore, 


W=D*P (PP) oe 
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and we have the conditional least squares estimate for W. We have in fact minimized SSTRESS at 
this point relative to the previously computed values for stimulus coordinates and optimal scaling. 
We replace the old subject weights with the newly estimated values. 


(ii) Estimation of Stimulus Coordinates. The next step is to estimate coordinates, one at a time, 
using the previously computed valud9 for (disparities) and weights. Coordinates are 
determined one at a time by minimizing SSTRESS with regard to a given coordinate. Equation 
(2) allows us to solve for a given,coordinate 


Ori. mack On}, 


as _ 15, 25k a) 


OS; __ 2 ms) eres Dm. m2 ae Di, ee 
oo = du a (a le ~ SLieLje + 2L eI 7 DTK le + bine je) (2) 
j 


Equation (2) can be substituted back into equation (1). This equation with one unknown, 7;,, is 
then set equal to zero and solved by standard techniques. All the other coordinates except ;,. are 
assumed to be constant while we solve for 7,,. 


Immediately upon solving for x), we replace the value for z;,. used on the previous iteration with 
the newly obtained value, and then proceed to estimate the value for another coordinate. We 
successively obtain values for each coordinate of point /, one at a time, replacing old values with 
new ones. This continues for point / until the estimates stabilize. We then move to a new point and 
proceed until new coordinates for all stimuli are estimated. We then return to the beginning of the 
optimization algorithm (the previous step above) and start another iteration. 
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ANACOR Algorithms 


The ANACOR algorithm consists of three major parts: 
1. A singular value decomposition (SVD) 
2. Centering and rescaling of the data and various rescalings of the results 


3. Variance estimation by the delta method. 


Other names for SVD are “Eckart-Young decomposition” after Eckart and Young (1936), who 
introduced the technique in psychometrics, and “basic structure” (Horst, 1963). The rescalings 
and centering, including their rationale, are well explained in Benzécri (1969), Nishisato (1980), 
Gifi (1981), and Greenacre (1984). Those who are interested in the general framework of matrix 
approximation and reduction of dimensionality with positive definite row and column metrics 
are referred to Rao (1980). The delta method is a method that can be used for the derivation 

of asymptotic distributions and is particularly useful for the approximation of the variance of 
complex statistics. There are many versions of the delta method, differing in the assumptions 
made and in the strength of the approximation (Rao, 1973, ch. 6; Bishop et al., 1975, ch. 14; 
Wolter, 1985, ch. 6). 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


ky Number of rows (row objects) 
kg Number of columns (column objects) 
Pp Number of dimensions 


Data-Related Quantities 


Sig Nonnegative data value for row i and column j: collected in table F 
fit Marginal total of rowi, i= 1, ..., 41 

f+; Marginal total of column j, 7 = 1, ..., Ay 

N Grand total of F 


Scores and Statistics 
Vis Score of row object i on dimension s 
Cis Score of column object j on dimension s 


I Total inertia 
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Basic Calculations 


One way to phrase the ANACOR objective (cf. Heiser, 1981) is to say that we wish to find row 
scores {7;,} and column scores {c;,} so that the function 


2 
adraticuh= > > fad Cw cha) 
t j s 
is minimal, under the standardization restriction either that 
> fi tTislit = on 
a 


or 


where 6°! is Kronecker’s delta and t is an alternative index for dimensions. The trivial set of 
scores ({1},{1}) is excluded. 
The ANACOR algorithm can be subdivided into five steps, as explained below. 


Data scaling and centering 


The first step is to form the auxiliary matrix Z with general element 


Singular value decomposition 
Let the singular value decomposition of Z be denoted by 
Z=KAL 


with K K =I, L L =I, and L diagonal. This decomposition is calculated by a routine based 
on Golub and Reinsch (1971). It involves Householder reduction to bidiagonal form and 
diagonalization by a QR procedure with shifts. The routine requires an array with more rows 
than columns, so when fk, < éz the original table is transposed and the parameter transfer is 
permuted accordingly. 


Adjustment to the row and column metric 


The arrays of both the left-hand singular vectors and the right-hand singular vectors are adjusted 
row-wise to form scores that are standardized in the row and in the column marginal proportions, 
respectively: 


lis = kis Vii | /N, 


Cis = Lis/ Jf +3/N- 
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This way, both sets of scores satisfy the standardization restrictions simultaneously. 


Determination of variances and covariances 


For the application of the delta method to the results of generalized eigenvalue methods under 
multinomial sampling, the reader is referred to Gifi (1981, ch. 12) and Israéls (1987, Appendix 
B). It is shown there that N time variance-covariance matrix of a function @ of the observed cell 
proportions p = {pi; = fi;/N} asymptotically reaches the form 


-= . * Oo Od (aren Oo 
Nx cout o(n))= > m5) (=) D DaMigg Snug! 
= 3 : . a i : 5 


Here the quantities 7;; are the cell probabilities of the multinomial distribution, and 0@/0p;; are 
the partial derivatives of @ (which is either a generalized eigenvalue or a generalized eigenvector) 
with respect to the observed cell proportion. Expressions for these partial derivatives can also 

be found in the above-mentioned references. 


Normalization of row and column scores 
Depending on the normalization option chosen, the scores are normalized, which implies a 
compensatory rescaling of the coordinate axes of the row scores and the column scores. The 


general formula for the weighted sum of squares that results from this rescaling is 


row scores: x, fisr?, = NA(1+¢) 
F 


column scores: do f4iGja = NAa(1 — 4) 
The parameter q can be chosen freely or it can be specified according to the following designations: 
0, canonical 
q= ‘1,  rowprincipal 
—1, column principal 
There is a fifth possibility, choosing the designation “principal,” that does not fit into this scheme. 


It implies that the weighted sum of squares of both sets of scores becomes equal to .V \2. The 
estimated variances and covariances are adjusted according to the type of normalization chosen. 


Diagnostics 


After printing the data, ANACOR optionally also prints a table of row profiles and column 
profiles, which are { f;;/f;,} and {/;;//,;}, respectively. 
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Singular Values, Maximum Rank and Inertia 


All singular values \ defined in step 2 are printed up to a maximum of min {(4, — 1), (A — 1)}. 
Small singular values and corresponding dimensions are suppressed when they don’t exceed the 
quantity (/:k2)'/°10~"; in this case a warning message is issued. Dimensionwise inertia and total 
inertia are given by the relationships 


repneTE et 


where the right-hand part of this equality is true only if the normalization is row principal (but 
for the other normalizations similar relationships are easily derived from “Normalization of 
row and column scores ”). The quantities “proportion explained” are equal to inertia divided 
by total inertia: \?/J. 


Scores and Contributions 


This output is given first for rows, then for columns, and always preceded by a column of marginal 
proportions (f;, /.V and f.;/.V, respectively). The table of scores is printed in p dimensions. The 
contribution to the inertia of each dimension is given by 


2 
r2 
i 
NZ 


Tjs = N js 

The above formula is true only under the row principal normalization option. For the other 
normalizations, similar relationships are again easily derived from “Normalization of row and 
column scores ”) The contribution of dimensions to the inertia of each point is given by, for 
858 1k ee Ps 


. 5 
se 
t 


=-b9 


t 
2. 2 
it ie 


Jis =7 
Fis =C 


+ 


Variances and Correlation Matrix of Singular Values and Scores 


The computation of variances and covariances is explained in “Determination of variances and 
covariances ”. Since the row and column scores are linear functions of the singular vectors, an 
adjustment is necessary depending on the normalization option chosen. From these adjusted 
variances and covariances the correlations are derived in the standard way. 


Permutations of the Input Table 


For each dimension s, let p(7|s) be the permutation of the first ky integers that would sort the 
sth column of {7;,} in ascending order. Similarly, let »(j|s) be the permutation of the first 

kg integers that would sort the sth column of {c;,} in ascending order. Then the permuted data 
matrix is given by { f,i).),( Js) }- 
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ANOVA Algorithms 


This chapter describes the algorithms used by the ANOVA procedure. 
Model and Matrix Computations 


Notation 


The following notation is used throughout this section unless otherwise stated. 


Table 7-1 
Notation 


Notation Description 


N Number of cases 
F Number of factors 
CN Number of covariates 
k; Number of levels of factor i 
Yi, Value of the dependent variable for case k 
Zk Value of the jth covariate for case k 
Wh Weight for case k 
W Sum of weights of all cases 
The Model 


A linear model with covariates can be written in matrix notation as 
Y=X+ZC+e (1) 


where 


Table 7-2 
Notation 


Notation Description 


¥ N x 1 vector of values of the dependent variable 
x Design matrix (N x p) of rank g < p 

8 Vector of parameters (p x 1) 

Z Matrix of covariates (N x CN) 

Cc Vector of covariate coefficients (CN x 1) 


e Vector of error terms (NV x 1) 
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Constraints 


To reparametrize equation (1) to a full rank model, a set of non-estimable conditions is needed. 
The constraint imposed on non-regression models is that all parameters involving level 1 of 
any factor are set to zero. 


For regression model, the constraints are that the analysis of variance parameters estimates for 
each main effect and each order of interactions sum to zero. The interaction must also sum to 
zero over each level of subscripts. 


For a standard two way ANOVA model with the main effects a; and (3;, and interaction parameter 
7; ;, the constraints can be expressed as 


a, = 61 = 71; =Vi1 =0 non — regression 
Qe = Be = Jie = Ye; = 0 regression 


where e indicates summation. 


Computation of Matrices 
xX 
Non-regression Model 


The X X matrix contains the sum of weights of the cases that contain a particular combination of 
parameters. All parameters that involve level 1 of any of the factors are excluded from the matrix. 
For a two-way design with k; = 2 and ky = 3, the symmetric matrix would look like the following: 


a2 Bo Bs Y22 723 
a2 Noe No2 N23 N22 N23 
32 Ne2 0 Noo 0 
33 Ne3 0 N93 
22 Nao 0 
‘Y23 Nog 


The elements V;, or V.; on the diagonal are the sums of weights of cases that have level i of a or 
level j of ‘5. Off-diagonal elements are sums of weights of cases cross-classified by parameter 
combinations. Thus, V,; is the sum of weights of cases in level 3 of main effect 33, while N22 is 
the sum of weights of cases with ay and 3. 


Regression Model 
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A row of the design matrix X is formed for each case. The row is generated as follows: 


If a case belongs to one of the 2 to &; levels of factor i, a code of 1 is placed in the column 
corresponding to the level and 0 in all other 4; — 1 columns associated with factor i. If the case 
belongs in the first level of factor i, -1 is placed in all the k; — 1 columns associated with factor i. 
This is repeated for each factor. The entries for the interaction terms are obtained as products of 
the entries in the corresponding main effect columns. This vector of dummy variables for a case 
will be denoted as d(i),i =1,...,: VC’, where NC is the number of columns in the reparametrized 
design matrix. After the vector d is generated for case k, the ijth cell of XX is incremented by 
d(i)d(j)w,, where i=1,...,/ VC' and j >i. 


Checking and Adjustment for the Mean 


After all cases have been processed, the diagonal entries of XX are examined. Rows and 
columns corresponding to zero diagonals are deleted and the number of levels of a factor is 
reduced accordingly. If a factor has only one level, the analysis will be terminated with a message. 
If the first specified level of a factor is missing, the first non-empty level will be deleted from the 
matrix for non-regression model. For regression designs, the first level cannot be missing. All 
entries of X X are subsequently adjusted for means. 


The highest order of interactions in the model can be selected. This will affect the generation of 
X X If none of these options is chosen, the program will generate the highest order of interactions 
allowed by the number of factors. If sub-matrices corresponding to main effects or interactions in 
the reparametrized model are not of full rank, a message is printed and the order of the model is 
reduced accordingly. 


Cross-Product Matrices for Continuous Variables 


Provisional means algorithm are used to compute the adjusted-for-the-means cross-product 
matrices. 


Matrix of Covariates Z’Z 
The covariance of covariates m and I after case k has been processed is 


k k 


we | WiZin — >) wi Zz | | WeZme — >= wiZmy 
j=! j=l 


WiWp-1 


ZZ nik) = Z Zni(k 1) 
where !V’;, is the sum of weights of the first k cases. 


The Vector Z’Y 


The covariance between the mth covariate and the dependent variable after case k has been 
processed is 
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k k 
Wk | WeYn — D> wz¥3 | | WiZma — > © wiZmy 


j=l j=l 
WW -1 


The Scalar Y’Y 


The corrected sum of squares for the dependent variable after case k has been processed is 


2 
k 
wre | We¥e — So wi¥j 
i j j= 1 
Y'Y(k) =Y Y(k—-1)+ aa 
Wk-1 


The Vector X’Y 


X Y is a vector with NC rows. The ith element is 


N 
KY; => Viwedks 
k=1 


where, for non-regression model, 6+ =1 if case k has the factor combination in column i of X X: 


6 = 0 otherwise. For regression model, de = d(i) where d(i) is the dummy variable for column i of 
case k. The final entries are adjusted for the mean. 


Matrix X’Z 


The (i, m)th entry is 


NV 
X Lim = > ZmkWhOk 
k=1 


where 0;, has been defined previously. The final entries are adjusted for the mean. 


Computation of ANOVA Sum of Squares 


The full rank model with covariates 
Y=X6+ZC+e 
can also be expressed as 


Y = X,by + Xmbm + ZC +e 
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where X and b are partitioned as 
X= [Xp.|Km] and 3 = Ke | : 


The normal equations are then 


DE Ts TK. C ZY 
XpZ KX, KEXy, by | = | XnY (2) 
X mL x m&<k Xx mam b,,, x my. 


The normal equations for any reduced model can be obtained by excluding those entries from 
equation (2) corresponding to terms that do not appear in the reduced model. 


Thus, for the model excluding b,,,, 
Y = X;b, + ZC +e 


the solution to the normal equation is: 


CO) -N2e. Bell ay (3) 
b;. = X ,Z X . Xp X,Y 


The sum of squares due to fitting the complete model (explained SS) is 


RC bbs) = te bjs Bin] Nay |=OLY +h kw +b kay 
X mY 


For the reduced model, it is 


R(C,b;) = [C',6, | ae =C'ZY+6,X'.Y 


The residual (unexplained) sum of squares for the complete model is 

RSS =YY- R(C, b;,.,b,,,) and similarly for the reduced model. The total sum 

of squares is YY. The reduction in the sum of squares due to including b,,, in a model that 
already includes b;,, and C will be denoted as /?(b,,,|C, b;.). This can also be expressed as 


R(bm|C, by) = R(C, by, bm) = R(C, bx) 


There are several ways to compute /?(b,,,|C,b;,,). The sum of squares due to the full model, as 
well as the sum of squares due to the reduced model, can each be calculated, and the difference 
obtained (Method 1). 

R(by,|\C,by) = C ZY +b, X pY +b, X m¥ —C ZY —b, X,Y 


A sometimes computationally more efficient procedure is to calculate 
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A 


R(bm IC, b;.) a bt aaba 


l 


where b,,, are the estimates obtained from fitting the full model and T,,,, is the partition of the 
inverse matrix corresponding to b,,, (Method 2). 


; ; ' -1 
ZZ Z Xj, Z Xm T. Tex Tem 


XZ XpXp XpXm = [Tre Tr Tram 
XZ xx bee ae Tine Tink Tin 


Model and Options 


Notation 


Let b be partitioned as 


my, 
M m; 
b= st 
D di 
dyp_1 
where 
Table 7-3 
Notation 
Notation Description 
M Vector of main effect coefficients 
m; Vector of coefficients for main effect i 
m® M excluding m; 
M'* M including only m; through m;_; 
D Vector of interaction coefficients 
d;. Vector of kth order interaction coefficients 
dy, Vector of coefficients for the ith of the kth order interactions 
pD®) Dexcluding d; 
D* D including only d, through d;._; 


i) d;. excluding dy, 


d 
Cc Vector of covariate coefficients 
Ci Covariate coefficient 
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Notation Description 
ct!) C excluding c; 


c** C including only c; through ¢;_1 


Models 


Different types of sums of squares can be calculated in ANOVA. 


Sum of Squares for Type of Effects 


Table 7-4 
Sum of squares for type of effect 


Type Covariates Main Effects Interactions 
Experimental and R(C) R(M|C) R(d;.|C,M,D**) 
Hierarchical 

Covariates with Main R(C,M) R(C,M) R(d,|C,M,D**) 
Effects 

Covariates after Main R(C|M) R(M) R(d,,|C,M,D**) 
Effects 

Regression R(C\M,D) R(M|C,D) R(d,|C,M,D**) 


All sums of squares are calculated as described in the introduction. Reductions in sums of squares 
(R(A|B)) are computed using Method 1. Since all cross-product matrices have been corrected for 
the mean, all sums of squares are adjusted for the mean. 


Sum of Squares Within Effects 


Table 7-5 
Sum of squares within effects 


Type Covariates Main Effects Interactions 
Default Experimental R(alc) R(m,|C,M") R(di,|C,M, D**,4\”) 
Covariates with Main R(c IM, cl ') R(m CM" ) same as default 
Effects me oe 

Covariates after Main LIN (i) (i) same as default 
anh R(ci|M,C) R(m,|M") 

Regression R(ciJM,C™,D) R(m|M”,C,D) R(dx, C,M,D%)) 
Hierarchical R(c:|C'*) R(m;|C,M'"*) same as default 
Hierarchical and Covariates R(c:|C'*,M) R(m;|M' ‘) same as default 
with Main Effects or 

Hierarchical and Covariates 

after Main Effects 
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Reductions in sums of squares are calculated using Method 2, except for specifications involving 
the Hierarchical approach. For these, Method 1 is used. All sums of squares are adjusted for 
the mean. 


Degrees of Freedom 


Main Effects 


df = Ss (A; — 1) 


=1 


Main Effect i 


(kj; — 1) 
Covariates 
di. CN 


Covariate i 


1 


Interactions 
Interactions d,: 
df, = number of linearly independent columns corresponding to interaction d, in X X 
Interactions d,,.,: 


df = number of independent columns corresponding to interaction d,., in X X 


Model 


df Model = df + df. + Safe 


r=l 
Residual 


W-1- df Model 


Total 
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Multiple Classification Analysis 


Notation 


Table 7-6 
Notation 


Notation 
Yijk 


Nij 


hi; 
Ww 


Description 
Value of the dependent variable for the kth case in level j of main effect i 


Sum of weights of observations in level j of main effect 
i 


Number of nonempty levels in the ith main effect 


Sum of weights of all observations 


Basic Computations 


Mean of Dependent Variable in Level j of Main Effect i 


Nij 


Vig = > Yigal nis 
k=1 


Grand Mean 


Fee Vwi 
a j k 


Coefficient Estimates 


The computation of the coefficient for the main effects only model (b;;) and coefficients for the 
main effects and covariates only model (2:;) are obtained as previously described. 
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Calculation of the MCA Statistics (Andrews, et al., 1973) 


Deviations 


For each level of each main effect, the following are computed: 


Unadjusted Deviations 


The unadjusted deviation from the grand mean for the jth level of the ith factor: 


Mis = Yi; = iY 


Deviations Adjusted for the Main Effects 


hj 
mi, = big = » bjjni;/W, where bj, =0. 
j=2 


Deviations Adjusted for the Main Effects and Covariates (Only for Models with 
Covariates) 


hei 
mi, = biye= Ss bjj;ni;/W, where bj, =0. 


j=2 


ETA and Beta Coefficients 


For each main effect i, the following are computed: 


ETA; = >| Nij eae _ Y)*/Y'Y 


Beta Adjusted for Main Effects 


k; 
Beta; = ye nig(mi,)"/¥'Y 


j=2 


Beta Adjusted for Main Effects and Covariates 


k; 
Beta; = s nig(m?,)" (YY 


j=2 
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Squared Multiple Correlation Coefficients 


Main effects model 


R2 = RM) 


m YY’ 


Main effects and covariates model 


92 _ R(M,C) 
one _ Y'Y 


The computations of R(M), R(M,C), and Y Y are outlined previously. 


Unstandardized Regression Coefficients for Covariates 


Estimates for the C vector, which are obtained the first time covariates are entered into the model, 
are printed. 


Cell Means and Sample Sizes 


Cell means and sample sizes for each combination of factor levels are obtained from the X Y and 
X X matrices prior to correction for the mean. 


Means for combinations involving the first level of a factor are obtained by subtraction from 
marginal totals. 


Matrix Inversion 


The Cholesky decomposition (Stewart, 1973) is used to triangularize the matrix. If the tolerance is 
less than 10~°, the matrix is considered singular. 
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In the ordinary regression model the errors are assumed to be uncorrelated. The model considered 
here has the form 


P 
yY=a So bee fu, €=1,...,n (1) 


= 
Ut = Put—-1 keg 


where ¢; is an uncorrelated random error with variance o? and zero mean. The error terms 

u, follow a first-order autoregressive process. The constant term a can be included or excluded 
as specified. In the discussion below, if a is not included, it is set to be zero and not involved in 
the subsequent computation. 


Two computational methods—Prais-Winsten and Cochrane-Orcutt—are described here. 


Cochrane-Orcutt Method 


Note that model (1) can be rewritten in two equivalent forms as: 


Pp 
ye — pyri = a(L—p) +) bi(wei — pee-iyi) +e (2) 


i=1 


Pp Pp 
Ye —a— So bite = a(n —a- Ye] + (3) 
i=l i=1 


Defining yf = y, —pyr:—1 and «j= 2; — pr(;_1); for t = 2, ...,n, equation (2) can be rewritten 
as 

Pp 
ye =a(l—p)4 S bie; + (2*) 

i=l 


Starting with an initial value for p, the difference y; and :°/; in equation (2*) are computed and 
OLS then applied to equation (2*) to estimate a and b,. These estimates in turn can be used in 
equation (3) to update # and the standard error of the estimate 


Initial Results 


An initial value for p can be pre-set by the user or set to be zero by default. The OLS method is 
used to obtain an initial estimate for a (if constant term is include) and b,j. 


ANOVA 


Based on the OLS results, an analysis of variance table is constructed in which the degrees of 
freedom for regression are p, the number of X variables in equation (1), while the degrees of 
freedom for the residual are m» — p* —1 if initial » ~ 0 and are n — p* otherwise. p* isthe 
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number of coefficients in equation (1). The sums of squares, mean squares, and other statistics are 
computed as in the REGRESSION procedure. 


Intermediate Results 
At each iteration, the following statistics are calculated: 
Rho 


An updated value for » is computed as 


n 


) Upup—] 


where the residuals wu, are obtained from equation (1). 


Standard Error of rho 


An estimate of the standard error of / 


1-/, 
iba 1-9 


where p* = p+ lif there is a constant term; p otherwise. 


Durbin-Watson Statistic 


= (G41 —&)° 


yy = 
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Mean Square Error 


An estimate of the variance of «, 
n 


ye (uy = piie_1)” 


Mora 
m—-2— p* 


Final Results 


Iteration terminates if either all the parameters change by less than a specified value (default 
0.001) or the number of iterations exceeds the cutoff value (default 10). 


The following variables are computed for each case: 


FIT 


Fitted responses are computed as 


n=" 
and 
Ut = Ut + pur t= 2 ave fl 


in which 4 is the final estimate of g and 


P 
y=at 3 Dix ti 
i=] 
ue =Ye—-Ye TH 1,...,7 
ERR 
Residuals are computed as 


€e=Ye—YUe t=2,...,7 


&=/1-@(y1—-in) 


SEP 


Standard error of predicted values at time t 
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1 ZS 
SEP, = Vasey] ( +h , 


1—/? 
and 
SER = VMSEy/ (1+ hi) hea RR | 
where 


' 


Ls ro\Al 
hy =Xi(X"X*) Xi 
in which X; is the predictor vector at time i with the first component 1 if a constant term is 


included in equation (2*). X* is a (n — 1) Xp* design matrix for equation (2*). The first column 
has a value of 1 — f if a constant term is included in equation (2*). 


LCL and UCL 


95% prediction interval for the future 7; is 


Uk — tn—1—p:;0.0259 LP, 


Other Statistics 


Other statistics such as Multiple R, R-Squared, Adjusted R-Squared, and so on, are computed. 
Consult the REGRESSION procedure for details. 


Prais-Winsten Method 


This method is a modification of the Cochrane-Orcutt method in that the first case gets explicit 
treatment. By adding an extra equation to (2*), the model has the form of 


P 
(l—p)yi =a(1—p)4 So bi(L =p) t(1 — p)uy 
es @) 
UE =a(l— p) | So bx; Le t=2,...,n 
i=1 


Like the Cochrane-Orcutt method, an initial value of p can be set by the user or a default value of 
zero can be used. The iterative process of estimating the parameters is performed via weighted 
least squares (WLS). The weights used in WLS computation are w, = (1 — p7)/(1 — p)” and 

w; = 1 for? = 2. ...,n. The computation of the variance of «; and the variance of 4 is the same as 
that of the WLS in the REGRESSION procedure. 
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Initial Results 


The WLS method is used to obtain initial parameter estimates. 


ANOVA 


The degrees of freedom are p for regression and n — p* for residuals. 


Intermediate Results 


The formulas for RHO, SE Rho, DW, and MSE are exactly the same as those in the 
Cochrane-Orcutt method. The degrees of freedom for residuals, however, aren — 1 — p*. 


Final Results 


The following variables are computed for each case. 


SEP 


Standard error of predicted value at time t is computed as 


1 oe 
=p 
SEP, =VMSE (1 | hv) £=2,...,0 


where h is computed as 
= F -l1 , 
hy =Xi(x* x*) x; 


in which X; is the predictor vector at time i and X* is a n x p* design matrix for equation (4). If a 
constant term is included in the model, the first column of X* has a constant value of 
1 — #, the first row of X* is \/wi(r11,.-.,@1)), and p* =p+1 


LCL and UCL 


95% prediction interval for y;, at time k is 


Uk oc tn—p*;0.0259 EL Py 


ARIMA Algorithms 


The ARIMA procedure computes the parameter estimates for a given seasonal or non-seasonal 
univariate ARIMA model. It also computes the fitted values, forecasting values, and other related 
variables for the model. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


yt (t=1, 2, ..., N) Univariate time series under investigation. 
N Total number of observations. 
ay (t = 1, 2,... , N) White noise series normally distributed with mean zero and variance o2. 
p Order of the non-seasonal autoregressive part of the model 
q Order of the non-seasonal moving average part of the model 
d Order of the non-seasonal differencing 
P Order of the seasonal autoregressive part of the model 
Q Order of the seasonal moving-average part of the model 
D Order of the seasonal differencing 
s Seasonality or period of the model 
dp (B) AR polynomial of B of order p, , (B) =1—yiB —y2B? —...—y,B? 
6, (B) MA polynomial of B of order q, 0, (B) =1—0,B—v2B? —...—0,B" 
Pp (B*) Seasonal AR polynomial of BS of order P, 
®p(B*)=1—0,B° — @)B° —...—OpB? 
Og (B*) Seasonal MA polynomial of BS of orderQ, 
Og (B*) =1—-O1B* — O2B° — ... —OQB°® 
A Differencing operator A = (1— B \4(1 — Bs)P 
B Backward shift operator with BY; = Y;_; and Ba; = a;— 


Models 
A seasonal univariate ARIMA(p,d,q)(P,D,Q)s; model is given by 
@(B)(Ay, — 1] = O(B)a, t=1,...,N 


where 


and i is an optional model constant. It is also called the stationary series mean, assuming that, after 
differencing, the series is stationary. When NOCONSTANT is specified, 1 is assumed to be zero. 


An optional log scale transformation can be applied to y; before the model is fitted. In this chapter, 
the same symbol, y;, is used to denote the series either before or after log scale transformation. 
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Independent variables x1, x2, ..., Xm can also be included in the model. The model with 
independent variables is given by 


&(E fa (u ~ 28 ni) =H = O(B)a, 

where 

ci,t = 1,2,...,m, are the regression coefficients for the independent variables. 
Estimation 


Basically, two different estimation algorithms are used to compute maximum likelihood (ML) 
estimates for the parameters in an ARIMA model: 


= Melard’s algorithm is used for the estimation when there is no missing data in the time 
series. The algorithm computes the maximum likelihood estimates of the model parameters. 
The details of the algorithm are described in (Melard, 1984), (Pearlman, 1980), and (Morf, 
Sidhu, and Kailath, 1974). 


# A Kalman filtering algorithm is used for the estimation when some observations in the time 
series are missing. The algorithm efficiently computes the marginal likelihood of an ARIMA 
model with missing observations. The details of the algorithm are described in the following 
literature: (Kohn and Ansley, 1986) and (Kohn and Ansley, 1985). 


Initialization of ARMA parameters 

The ARMA parameters are initialized as follows: 

Assume that the series Y; follows an ARMA(p,q)(P,Q) model with mean 0; thatis: 
Y; — gi Yi-1 — +++ — PpYi-p = Gt — O1a4-1 — +++ — Og Qt—q 


In the following c; and ,; represent the /th lag autocovariance and autocorrelation of 
Y; respectively, and c; and /; represent their estimates. 


Non-seasonal AR parameters 


For AR parameter initial values, the estimated method is the same as that i in appendix A6.2 of 


(Box, Jenkins, and Reinsel, 1994). Denote the estimates as ),-+- 7, 4- 


Non-seasonal MA parameters 
Let 
we = Yr, — pi ¥i-1 — +++ — p¥i-p = at — O1a¢-1 — +++ — Ogar—q 


The cross covariance 
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a2 l=0 
02 i= 1 
A = E(wesrae) = E((ae41 — 91¢41-1 — +++ — OgQt41—g) at) = . . 
O02 l=q 
l>q 


Assuming that an AR(pt+q) can approximate Y;, it follows that: 


- , 2, , , , , a 
Yi — 9 1Yt-1 — ++ — 8 pYi—p — py Yt-p-1 — *°* — Y pq ¥t-p—q = 


, 


The AR parameters of this model are estimated as above and are denoted as (3,,--- ,2 eee 
Thus \; can be estimated by 


r - - , Whe! ’ Ss 
has Bie — prVistan ~~ ¥en (Yo Ys = = ptq¥i-p-a)) 
P+4 p pt+q 


=| Pl —S> gjpi4j - Yo Pim Se pts CO 
j=l t=1 j=1 


And the error variance c? is approximated by 


p+q p+aptd P+aP+4q 


~9 - ! = , ! not 
a =Var(-Se Ns | =D vie ei =o YW Pies 
j=0 i=0 j=0 i=0 j=0 
with Yo = 1 
Then the initial MA parameters are approximated by 6; = —,/o? and estimated by 
pt+a Pp pt 
o- >- Ojpisj - Sane >» S PiPjPl+j-i 
= Re PAD j=l i=1 j=1 
a = —Az/ Oa = ee 
s ye PiP jPi-j 
i=0 j=0 


So @; can be calculated by Gj, Gi, and { ean . In this procedure, only {/,})"" |’ are used and all 
other parameters are set to 0. 


Seasonal parameters 


For seasonal AR and MA components, the autocorrelations at the seasonal lags in the above 
equations are used. 
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Diagnostic Statistics 


The following definitions are used in the statistics below: 


Np Number of parameters. 
yy J P+Gt+P4+Q+m without model constant 
Pe Vp+a+P+Q+m-+1-— with model constant 
SSQ Residual sum of squares SSQ = e e, where e is the residual vector 
Gi Estimated residual variance. 7; 5°, where df = N — N, 
SSQ’” Adjusted residual sum of squares. SSQ’ = (SSQ)|Q|'/’, where Q is the 


theoretical covariance matrix of the observation vector computed at MLE 


Log-Likelihood 


r SSQ' N In(27) 
L=—Nin(6,) -— 33 ee 


Akaike Information Criterion (AIC) 


AIC= — 2L +2N, 
Schwartz Bayesian Criterion (SBC) 
SBC =—-2L4+In(N)N, 


Generated Variables 


The following variables are generated for each case. 


Predicted Values 


Computation of predicted values depends upon the forecasting method. 


Forecasting Method: Conditional Least Squares (CLS or AUTOINT) 


In general, the model used for fitting and forecasting (after estimation, if involved) can be 


written as 

y —D(B)y = &(B)u + O(B)a; 4 So ¢(@(B) Aris 
t=1 

where 


D(B)=O(B)A-I1 
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Thus, the predicted values (FIT), are computed as follows: 


(FIT), =i =D(B)in + ®(B) w+ O(B)a + S- ((B) Axi 
i=l 


where 
a=u-m Astsn 


Starting Values for Computing Fitted Series. To start the computation for fitted values, all 
unavailable beginning residuals are set to zero and unavailable beginning values of the fitted 
series are set according to the selected method: 


CLS. The computation starts at the (d+sD)-th period. After a specified log scale 

transformation, if any, the original series is differenced and/or seasonally differenced 

according to the model 

specification. Fitted values for the differenced series are computed first. All unavailable beginning 
fitted values in the computation are replaced by the stationary series mean, which is equal to the 
model constant in the model specification. The fitted values are then aggregated to the original 
series and properly transformed back to the original scale. The first d+sD fitted values are set to 
missing (SYSMIS). 


AUTOINIT. The computation starts at the [d+p+s(D+P)]-th period. After any specified log scale 
transformation, the actual d+p+s(D+P) beginning observations in the series are used as beginning 
fitted values in the computation. The first d+p+s(D+P) fitted values are set to missing. The fitted 
values are then transformed back to the original scale, if a log transformation is specified. 


Forecasting Method: Unconditional Least Squares (EXACT) 


As with the CLS method, the computations start at the (d+sD)-th period. First, the original series 
(or the log-transformed series if a transformation is specified) is differenced and/or seasonally 
differenced according to the model specification. Then the fitted values for the differenced series 
are computed. The fitted values are one-step-ahead, least-squares predictors calculated using the 
theoretical autocorrelation function of the stationary autoregressive moving average (ARMA) 
process corresponding to the differenced series. The autocorrelation function is computed by 
treating the estimated parameters as the true parameters. The fitted values are then aggregated 
to the original series and properly transformed back to the original scale. The first d+sD fitted 
values are set to missing (SYSMIS). The details of the least-squares prediction algorithm for the 
ARMA models can be found in (Brockwell and Davis, 1991). 


Residuals 


Residual series are always computed in the transformed log scale, if a transformation is specified. 


(ERR), =y —(FIT), t=1,2,...,N 
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Standard Errors of the Predicted Values 


Standard errors of the predicted values are first computed in the transformed log scale, if a 
transformation is specified. 


Forecasting Method: Conditional Least Squares (CLS or AUTOINIT) 


(SEP), =6, t=1,2,..... \ 


Forecasting Method: Unconditional Least Squares (EXACT) 


In the EXACT method, unlike the CLS method, there is no simple expression for the standard 
errors of the predicted values. The standard errors of the predicted values will, however, be given 
by the least-squares prediction algorithm as a byproduct. 

Standard errors of the predicted values are then transformed back to the original scale for each 
predicted value, if a transformation is specified. 


Confidence Limits of the Predicted Values 


Confidence limits of the predicted values are first computed in the transformed log scale, if a 
transformation is specified: 


(LCL), = (FIT), = tise sna(SEP), tH 1,.2).<2,1 \ 
(UCL), = (FIT), + ti-ajoap(SEP), #=1,2,...,4 V 
where f)_../2,«¢ is the (1 —a/2)  -th percentile of a ¢ distribution with df degrees of freedom and a 


is the specified confidence level (by default a=0.05). 
Confidence limits of the predicted values are then transformed back to the original scale for 
each predicted value, if a transformation is specified. 


Forecasting 


The following values are computed for each forecast period. 


Forecasting Values 


Computation of forecasting values depends upon the forecasting method. 


Forcasting Method: Conditional Least Squares (CLS or AUTOINIT) 


i (1), the l-step-ahead forecast of y,,, at the time ¢, can be represented as: 


it (1) = D(B) Gyr + B(B)u+O(B) ary + D> 4B (B) Aaj ry 
i=1 


Note that 
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a Ut+l—i if l Ss U 
Yt+l-—i = i (1 — i) if | i 
Yert-i — Get 1) if <% 
eT 10 ifl>3 


Forecasting Method: Unconditional Least Squares (EXACT) 


The forecasts with this option are finite memory, least-squares forecasts computed using the 
theoretical autocorrelation function of the series. The details of the least-squares forecasting 
algorithm for the ARIMA models can be found in (Brockwell et al., 1991). 

Standard Errors of the Forecasting Values 


Computation of these standard errors depends upon the forecasting method. 


Forcasting Method: Conditional Least Squares (CLS or AUTOINIT) 


For the purpose of computing standard errors of the forecasting values, the model can be written 
in the format of weights (ignoring the model constant): 


loo) 
_— V(B)OQ(B), _ oy ys 25 ‘ i 
Yt = SianepcAet = ¥ (B)a,= pe ; Brat; 
4=0 
where 
Yo =l 
Then 
. 2 2 2 3. 
se [i (1)] = {1 tyr t+ ygt.. 4 et ater 


Note that, for the predicted value, ] = 1. Hence,(SEP), =o, at anytime t. 


Computation of ‘Weights. 'Y weights can be computed by expanding both sides of the following 
equation and solving the linear equation system established by equating the corresponding 
coefficients on both sides of the expansion: 


p(B) ®p (B) Avs (B) = 8, (B) Og (B) 


An explicit expression of ¥ weights can be found in (Box et al., 1994). 


Forecasting Method: Unconditional Least Squares (EXACT) 


As with the standard errors of the predicted values, the standard errors of the forecasting values 
are a byproduct during the least-squares forecasting computation. The details can be found in 
(Brockwell et al., 1991). 
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Automated Data Preparation 
Algorithms 


The goal of automated data preparation is to prepare a dataset so as to generally improve the 
training speed, predictive power, and robustness of models fit to the prepared data. 


These algorithms do not assume which models will be trained post-data preparation. At the end 
of automated data preparation, we output the predictive power of each recommended predictor, 
which is computed from a linear regression or naive Bayes model, depending upon whether the 
target is continuous or categorical. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Xx 


Li 


wy 


Wry 


Mx 


A continuous or categorical variable 
Value of the variable X for case i. 


Frequency weight for case i. Non-integer positive values are rounded to the nearest 
integer. If there is no frequency weight variable, then all f; = 1 . If the frequency 
weight of a case is zero, negative or missing, then this case will be ignored. 


Analysis weight for case i. If there is no analysis weight variable, then all w; = 1. If 
the analysis weight of a case is zero, negative or missing, then this case will be ignored. 


Number of cases in the dataset 
So", fil (a; is not missing), where J (expression) is the indicator function taking 


44i=1 
value 1 when the expression is true, 0 otherwise. 
So", fiw (a: is not missing) 


4i=1 


n 
Ss fil (a; and y; are not missing ) 
i=l 


n 
N= fiwil (a: and y; are not missing) 


i=1 
n 


The mean of variable X, Wx se fiwixil (x; is not missing) 
i=l 
n 
Se fiwi(ai — 7)” 
i=1 


n 
ae ae \ 7 . . ° 
Wyy = fiwiriT (x; and y: are not missing) 
i=1 


n 


be fiwi (vi — Ty) (yi — F,) 


i=l 


A note on missing values 


Listwise deletion is used in the following sections: 


m “Univariate Statistics Collection ” 
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“Basic Variable Screening ” 
“Measurement Level Recasting ” 
“Missing Value Handling ” 

“Outlier Identification and Handling ” 
“Continuous Predictor Transformations ” 
“Target Handling ” 


“Reordering Categories ” 


“Unsupervised Merge ” 

Pairwise deletion is used in the following sections: 
m= “Bivariate Statistics Collection ” 

“Supervised Merge ” 


r 
m “Supervised Binning ” 

m= “Feature Selection and Construction ” 
r 


“Predictive Power ” 


A note on frequency weight and analysis weight 


The frequency weight variable is treated as a case replication weight. For example if a case has 
a frequency weight of 2, then this case will count as 2 cases. 


The analysis weight would adjust the variance of cases. For example if a case x; of a variable X 
has an analysis weight w;, then we assume that 7; ~ NV (1, z). 

Frequency weights and analysis weights are used in automated preparation of other variables, but 
are themselves left unchanged in the dataset. 


Date/Time Handling 


Date Handling 


If there is a date variable, we extract the date elements (year, month and day) as ordinal variables. 
If requested, we also calculate the number of elapsed days/months/years since the user-specified 
reference date (default is the current date). Unless specified by the user, the “best” unit of duration 
is chosen as follows: 


1. Ifthe minimum number of elapsed days is less than 31, then we use days as the best unit. 


2. Ifthe minimum number of elapsed days is less than 366 but larger than or equal to 31, we use 
months as the best unit. The number of months between two dates is calculated based on average 
number of days in a month (30.4375): months = days / 30.4375. 


3. Ifthe minimum number of elapsed days is larger than or equal to 366, we use years as the best 
unit. The number of years between two dates is calculated based on average number of days in a 
year (365.25): years = days / 365.25. 
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Once the date elements are extracted and the duration is obtained, then the original date variable 
will be excluded from the rest of the analysis. 


Time Handling 


If there is a time variable, we extract the time elements (second, minute and hour) as ordinal 
variables. If requested, we also calculate the number of elapsed seconds/minutes/hours since 
the user-specified reference time (default is the current time). Unless specified by the user, the 
“best” unit of duration is chosen as follows: 


1. Ifthe minimum number of elapsed seconds is less than 60, then we use seconds as the best unit. 


2. Ifthe minimum number of elapsed seconds is larger than or equal to 60 but less than 3600, we 
use minutes as the best unit. 


3. Ifthe minimum number of elapsed seconds is larger than or equal to 3600, we use hours as the 
best unit. 


Once the elements of time are extracted and time duration is obtained, then original time predictor 
will be excluded. 


Univariate Statistics Collection 


Continuous Variables 


For each continuous variable, we calculate the following statistics: 


rmissing yur 


= Number of missing values: Vy »-;_, fil (a; ismissing) 

= Number of valid values: Vx 

@ Minimum value: min; 7; 

m@ Maximum value: max; 7; 

m= Mean, standard deviation, skewness. (see below) 

m= The number of distinct values I. 

= =The number of cases for each distinct value s;: c; =), fj 1 («j = si) 

m Median: If the distinct values of X are sorted in ascending order, s; < sy < +++ < SI then the 


7 
ce 


median can be computed by Median (X) = min {s a ore 0.5}, where cc; = ye Cj. 


j=l 


Note: If the number of distinct values is larger than a threshold (default is 5), we stop updating 
the number of distinct values and the number of cases for each distinct value. Also we do not 
calculate the median. 


Categorical Numeric Variables 


For each categorical numeric variable, we calculate the following statistics: 


m= Number of missing values: Ny" = 5°", f;1 (a; is missing) 
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Number of valid values: Ny 

Minimum value: min; 7; (only for ordinal variables) 
Maximum value: max; x; (only for ordinal variables) 
The number of categories. 

The counts of each category. 


Mean, Standard deviation, Skewness (only for ordinal variables). (see below) 


Mode (only for nominal variables). If several values share the greatest frequency of 
occurrence, then the mode with the smallest value is used. 


= Median (only for ordinal variables): If the distinct values of X are sorted in ascending order, 
Ss] < 89 <+-++ < s;, thenthe median can be computed by \/edian (X) = min {si ;<t > os}, 


2 Noe 


where cc; = Vi- ee 


Notes: 


1. If an ordinal predictor has more categories than a specified threshold (default 10), we stop 
updating the number of categories and the number of cases for each category. Also we do not 
calculate mode and median. 


2. If anominal predictor has more categories than a specified threshold (default 100), we stop 
collecting statistics and just store the information that the variable had more than threshold 
categories. 


Categorical String Variables 


For each string variable, we calculate the following statistics: 


= Number of missing values: NY!"""'"Y = S7"_, fi (a; ismissing) 

m= Number of valid values: Ny 

m The number of categories. 

= Counts of each category. 

= Mode: If several values share the greatest frequency of occurrence, then the mode with the 


smallest value is used. 


Note: If a string predictor has more categories than a specified threshold (default 100), we stop 
collecting statistics and just store the information that the predictor had more than threshold 
categories. 


Mean, Standard Deviation, Skewness 


We calculate mean, standard deviation and skewness by updating moments. 


L Start with NY? = Wy?) = = MY = My” 


0. 


2. For j=1,..,n compute: 


Ny = Ny? + f)J (a; is not missing) 


Automated Data Preparation Algorithms 


(J) r(j-l) > . ose 
Wy’ =Wy + f;wil (a; is not missing) 

5, — Fs fo m(j—-1) 
"5 wD (xj —a ) 

x 
zy) =FU-Yiv; 
r(F)azr7(G—-1) 

ng2td) — yy 2(i-1) ww? 9 

My =My 1 aU 
rj)qxr(G—-1) F¢ 

M39) — m3) _ 3p, M2G-Y 4, Wa War (WY? —2f;0;) 08 

My = My — ot J A Ty f (fiw; 2 X 2f; j j 


3. After the last case has been processed, compute: 


Skewness: skew = 


If Ny <2 orsd?< 107?° then skewness is not calculated. 


Basic Variable Screening 


1. If the percent of missing values is greater than a threshold (default is 50%), then exclude the 
variable from subsequent analysis. 


2. For continuous variables, if the maximum value is equal to minimum value, then exclude the 
variable from subsequent analysis. 


3. For categorical variables, if the mode contains more cases than a specified percentage (default 
is 95%), then exclude the variable from subsequent analysis. 


4. Ifa string variable has more categories than a specified threshold (default is 100), then exclude the 
variable from subsequent analysis. 


Checkpoint 1: Exit? 


This checkpoint determines whether the algorithm should be terminated. If, after the screening 
step: 


1. The target (if specified) has been removed from subsequent analysis, or 


2. All predictors have been removed from subsequent analysis, 


then terminate the algorithm and generate an error. 


Measurement Level Recasting 


For each continuous variable, if the number of distinct values is less than a threshold (default 
is 5), then it is recast as an ordinal variable. 
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For each numeric ordinal variable, if the number of categories is greater than a threshold (default 
is 10), then it is recast as a continuous variable. 


Note: The continuous-to-ordinal threshold must be less than the ordinal-to-continuous threshold. 


Outlier Identification and Handling 


In this section, we identify outliers in continuous variables and then set the outlying values to a 
cutoff or to a missing value. The identification is based on the robust mean and robust standard 
deviation which are estimated by supposing that the percentage of outliers is no more than 5%. 


Identification 


1. Compute the mean and standard deviation from the raw data. Split the continuous variable into 


non-intersecting intervals: I; = (7 + (i—1) x sdy,a+4x sd,],i= =—3, — 2,-+-,2,3,4, where 
I_3= (—73~o,F _— 3sdu, y= (7 + 38d, t oo] and sdy = = sd xX ast. 
4 


2. Calculate univariate statistics in each interval: 
Ny, = Vi-1 fjI (xj € 1h), Wi, = PRS _fiwil (2; € &;) 


= pee fywyxrj (aj Cli) 2 ~ 
Ey, = i=! i, » Mi = yaa fiw ;( —F,,)°I (a; € 1) 


3. Let! = —3, r= 4, and p=0O. 


4. Between two tail intervals J; and J,, find one interval with the least number of cases. 


5. If y, Ny? , then, Jeurrent = at. Check if p + peurrene is less than a threshold pij,-esnorq (default 


i 
is 0. 05). If it does, then’, + Deurrent and | = 1 + 1, go to step 4; otherwise, go to step 6. 


Else peurrent = Ye . Check if p+ peurrent iS less than a threshold, pip -esnoig If it is, then 
D=Dp+Peurrent end r =r —1, goto step 4; otherwise, go to step 6. 


6 Compute the robust mean 7,.,,,,,,, and robust standard deviation sd,.,i,,,, within the range 
(e+ (1-1) x sd,¥+r x sd]. See below for details. 


7. If x; satisfies the conditions: 
V Wi; (x; fate Lrobust) < —cutof f x Sdrobust or VJ wi (x; > Trobust) > cutof f x Sdrobust 


where cutoff is positive number (default is 3), then x; is detected as an outlier. 


Handling 


Outliers will be handled using one of following methods: 


m Trim outliers to cutoff values. If \/w; (a; — Tobust) < —cutof f x Sdyobust then replace x; by 
Frobust — cutof f X Sdrobust/VWis and if JWi (Li — Frobust) > cutof f x sdrobust then replace 
x; DY Zrobust + cutof f X sdrobust/ J wie 


= Set outliers to missing values. 
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Update Univariate Statistics 


After outlier handling, we perform a data pass to calculate univariate statistics for each continuous 
variable, including the number of missing values, minimum, maximum, mean, standard deviation, 
skewness, and number of outliers. 


Robust Mean and Standard Deviation 


1) x sd, t+ 


Robust mean and standard deviation within the range ‘= + (! rx sd) are calculated 


as follows: 


Ar aa 
Lrobust = P z 
Digi 1, 


and 


2 
robust 


ya Ny = 1 


where 1/2 Ar and AR MEW), (Frovust —T1,)° 


robust — LZai=l * 


8drobust = 


Missing Value Handling 


Continuous variables. Missing values are replaced by the mean, and the following statistics are 
updated: 


m Standard deviation: sd x V Asst where N = Nx +NX°""S, 


ean ks: _ ({_Nn Nx-2 _— /Nx-1 
m Skewness: skew x 7, where Ly = (5) ( a ) and Ly = \/ Wot 


rmissing 


m The number of missing values: Vy 0 


m= The number of valid values: Ny = N 


Ordinal variables. Missing values are replaced by the median, and the following statistics are 
updated: 


rmissing 


m= The number of cases in the median category: C;,edian + N¥ 


original number of cases in the median category. 


’ where Cmedian is the 


m The number of missing values: Vy'"""'"” =0 


m= The number of valid values: Ny = N 


Nominal variables. Missing values are replaced by the mode, and the following statistics are 


updated: 

m= = The number of cases in the modal category: ¢,,,a¢ + NY", where c,,oae is the original 
number of cases in the modal category. 

m= The number of missing values: Vy"""'"’ = 0 


m= The number of valid values: Ny = N 
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Continuous Predictor Transformations 


We transform a continuous predictor so that it has the user-specified mean 7,,,.,. (default 
0) and standard deviation sd,,,.- (default 1) using the z-score transformation, or minimum 
minyser (default 0) and maximum max,,,,.,- (default 100) value using the min-max transformation. 


Z-score Transformation 


Suppose a continuous variablehas mean x and standard deviation sd. The z-score transformation is 


' Sduser = > 
f= xX (vg — Z) + Duser 
sd 


a. : . . 
where z, is the transformed value of continuous variable X for case i. 


Since we do not take into account the analysis weight in the rescaling formula, the rescaled values 
x, follow a normal distribution NV (Zs Ee 8d ser 


Wi 
Update univariate statistics 


After a z-score transformation, the following univariate statistics are updated: 


m Number of missing values: NV'""'"" =Vy""" 
m Number of valid values: Vy) =Vx 
= Minimum value: min (x) = Stue ~ x (mine; — ©) + Byser 


m Maximum value: max («;) = ee X (max 2; — ©) + user 


m Mean:z — Duser 
m Standard deviation: sd («’) = Sduser 
m Skewness: skew («") = skew (x) 


Min-Max Transformation 


Suppose a continuous variable has a minimum value min; and a minimum value max .7;. The 
min-max transformation is 


*  MaXyser — WNyser ; : 
i= x (a; — minz;) + min 
Max Lj; — Ww; user 


where x; is the transformed value of continuous variable X for case i. 


Update univariate statistics 


After a min-max transformation, the following univariate statistics are updated: 


m= = The number of missing values: NV)" = Nyon? 
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m The number of valid values: Ny =Nx 

@ Minimum value: min («;) = MiNyser 

m Maximum value: max (2;) = MAXyser 

m Mean:7 — me Sa x (F —minz;) + Minyser 


m Standard deviation: sd (x ) = MAXuser—MiNyser yx gf 


max £;—min xj 


m Skwness: skew («') = skew (xr) 


Target Handling 


Nominal Target 


For a nominal target, we rearrange categories from lowest to highest counts. If there is a tie on 
counts, then ties will be broken by ascending sort or lexical order of the data values. 


Continuous Target 


The transformation proposed by Box and Cox (1964) transforms a continuous variable into one 
that is more normally distributed. We apply the Box-Cox transformation followed by the z score 
transformation so that the rescaled target has the user-specified mean and standard deviation. 


Box-Cox transformation. This transforms a non-normal variable Y to a more normally distributed 


variable: 
((yi-c)*-1) 
GO) =9(ys = x AFO 
In(y,—c) A=0 
where y;,2 = 1, 2,---,m are observations of variable Y, and c is a constant such that all values 


y; — care positive. Here, we choose c = min(Y) — 1. 


The parameter 4 is selected to maximize the log-likelihood function: 


Ny, [Ny-1 a 
L(A) = In Te isd g (A)? £O=1) B filn(y; — ©) 
where (sd (g(\)))° = Woot Doina Fiwilge (Ay) = 93) )° and g(\) = Wr doin fewigi(A). 


We perform a grid search over a user-specified finite set [a,b] with increment s. By default a=—3, 
b=3, and s=0.5. 


The algorithm can be described as follows: 


1. Compute \; = a + (j — 1) * s where j is an integer such that a < \; < b. 
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2. For each \;, compute the following statistics: 


Mean: 9 (Aj) = i “Ey Lfiwigi(A;) 


y 


Standard deviation: sd (g(Aj;)) = V en ye, Fiwilgi (Az) -— 9 (As) 


N n 3 3 
y HN _ fiwilgi(Aj)-GA))) 


(Ny -2) (Ny-1) 
sd(g(Aj ))3 


Skewness: skew (g(Aj;)) = 
Sum of logarithm transformation: S*" , f; In (y; —c) 

For each \;, compute the log-likelihood function L(\;). Find the value of j with the largest 
log-likelihood function, breaking ties by selecting the smallest value of \;. Also find the 


corresponding statistics 7 (A*); sd(g (A*)) and skew (g (\*)): 


Transform target to reflect user’s mean 7,,.,,.. (default is 0) and standard deviation sd.,,.,. (default 
is 1): 


xX (Gi (\*) a g(A*)) + Yuser 


5) 


where g (A*) = Wr phe 1 fewigi(r’) and sd(g(A*)) = Vrs SE 1 fiwilgi (A*) — 9 (A*))”- 


Update univariate statistics. After Box-Cox and Z-score transformations, the following 
univariate statistics are updated: 
ate x (g (min (y;) — c,A*) — G(A*)) + Duser 


sduser 


Maximum value: sag") * (g (max (y;) — 6, A*) — G (A*)) + Duser 


= Minimum value: 


Mean: 7,46, 


Standard deviation: sd,.s¢, 


Skewness: skew (g (A*)) 


Bivariate Statistics Collection 


For each target/predictor pair, the following statistics are collected according to the measurement 
levels of the target and predictor. 


Continuous target or no target and all continuous predictors 


If there is a continuous target and some continuous predictors, then we need to calculate the 
covariance and correlations between all pairs of continuous variables. If there is no continuous 
target, then we only calculate the covariance and correlations between all pairs of continuous 
predictors. We suppose there are there are m continuous variables, and denote the covariance 
matrix as Cm, With element c;;, and the correlation matrix as F,,,.m, With element 1°; ;. 


We define the covariance between two continuous variables X and Y as 
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1 n 
ONY y fiwi (Li — Ty) (Yi — Vz) 
tN DOYS iq 
where Zy = oes So", eT (a; and y; are not missing) and 
To = wey Di wil (w; and y; are not missing). 


The covariance can be computed by a provisional means algorithm: 


ii Aan O)  ) ee 
Start with Nyy = WV xv Ty = Taree = 0 


For j=1,..,.n compute: 
Ny ; = NY Yv 4 fj1 («; and y; are not missing) 


WY) =wyy") + fyw)l (a; and y; are not missing) 


Ty = Py + Ve; 
£5; 
yd ww) (Yj Vr) 
XY 


=i = — r Ww 2 
MY}, = My yy (2; — Fy) (yj; — Ve) (sim = A) ) 


Wry 
After the last case has been processed, we obtain: 
(rn) 7 = es 
Mxy = Myy = A 4 few (= Fy) Ve = Fe) 
Compute bivariate statistics between X and Y: 


Number of valid cases: Nyy 


Mxy 


Covariance: cxy = Wage 


Correlation: ryy = ——“—— 


VOXXVCYY 


Note: If there are no valid cases when pairwise deletion is used, then we let cy; = Oandryy = 0. 


Categorical target and all continuous predictors 


For a categorical target Y with values i = 1,2,---,./ and a continuous predictor X with values 
11,°++2,, the bivariate statistics are: 


Mean of X for each Y=i, i=1.,...,J: 


jai Fpwyrsl (yj =_ i) 
ae aoe 
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Sum of squared errors of X for each Y=i, i=1.,...,J: 


M3 = >> fyw;(ay — F. 6)°T (yj = 8) 


Sum of frequency weight for each Y=i, i=1....,J: 


n 
Ny = i fj 1 (yj =i a; is not missing) 
gal 


Number of invalid cases 


J 
Nxy => Ny 
i=1 
Sum of weights (frequency weight times analysis weight) for each Y=i, i=1,...,J: 


n 
j=l 


Continuous target and all categorical predictors 


For a continuous target Y and a categorical predictor X with values i=1,...,J, the bivariate statistics 
include: 


Mean of Y conditional upon X: 


I <n ; 

= ae jai Ape pygl (ej =i) 
Yo = yw An p | fig, as Pe 
dial Quai Sj W5l (tj = i) 


Sum of squared errors of Y: 
n 
2 mee 
My. = > fjw5(¥5 — Ye) 
j=1 


Mean of Y for each X = i, i=1.,...,J: 


eames 
yi fjwl (xj =1) 


oe 
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Sum of squared errors of Y for each X = i, i=1,...,J: 
9 n 2 
Mr = > fjw5(yj — i.) EG 4) 
j=l 


Sum of frequency weights for X = i, i=1,...,J: 


n 
N;. = 3 fj (vj =i y; is not missing) 
j=1 


Sum of weights (frequency weight times analysis weight) for X = i, i=1....,J: 


n 
W;. = ? fjw;l (vj =4A y; is not missing) 
j=1 


Categorical target and all categorical predictors 


For a categorical target Y with values j=1,...,J and a categorical predictor X with values i=1....,J, 
then bivariate statistics are: 


Sum of frequency weights for each combination of x; = i and y);. = 7: 
n 
Nig =o fil (te = 4A Ye =3) 
k=1 


Sum of weights (frequency weight times analysis weight) for each combination of x; = 7 and 
Y=: 


n 
Wij = Do Sawoel (ee = 1A ye =) 
kel 


Categorical Variable Handling 


In this step, we use univariate or bivariate statistics to handle categorical predictors. 


Reordering Categories 


For a nominal predictor, we rearrange categories from lowest to highest counts. If there is a tie on 
counts, then ties will be broken by ascending sort or lexical order of the data values. The new field 
values start with 0 as the least frequent category. Note that the new field will be numeric even if 
the original field is a string. For example, if a nominal field’s data values are “A”, “A”, “A”, “B”, 
“C”, “C”, then automated data preparation would recode “B” into 0, “C” into 1, and “A” into 2. 
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Identify Highly Associated Categorical Features 


If there is a target in the data set, we select a ordinal/nominal predictor if its p-value is not larger 
than an alpha-level a..j¢¢4jo, (default is 0.05). See “P-value Calculations” for details of 
computing these p-values. 


Since we use pairwise deletion to handle missing values when we collect bivariate statistics, 
we may have some categories with zero cases; that is, V;. = 0 for a category i of a categorical 
predictor. When we calculate p-values, these categories will be excluded. 


If there is only one category or no category after excluding categories with zero cases, we set the 
p-value to be 1 and this predictor will not be selected. 


Supervised Merge 


We merge categories of an ordinal/nominal predictor using a supervised method that is similar to a 
Chaid Tree with one level of depth. 


1. Exclude all categories with zero case count. 
2. If X has 0 categories, merge all excluded categories into one category, then stop. 
3. IfX has 1 category, go to step 7. 


4. Else, find the allowable pair of categories of X that is most similar. This is the pair whose test 
statistic gives the largest p-value with respect to the target. An allowable pair of categories for an 
ordinal predictor is two adjacent categories; for a nominal predictor it is any two categories. Note 
that for an ordinal predictor, if categories between the ith category and jth categories are excluded 
because of zero cases, then the ith category and jth categories are two adjacent categories. See 
“P-value Calculations” for details of computing these p-values. 


5. For the pair having the largest p-value, check if its p-value is larger than a specified alpha-level 
Aselection (default is 0.05). If it does, this pair is merged into a single compound category and 
at the same time we calculate the bivariate statistics of this new category. Then a new set of 
categories of X is formed. If it does not, then go to step 6. 


6. Go to step 3. 


7. For an ordinal predictor, find the maximum value in each new category. Sort these maximum 
values in ascending order. Suppose we have r new categories, and the maximum values are: 


iy <ig <-+++ <i,, then we get the merge rule as: the first new category will contain all original 
categories such that XY < 7,, the second new category will contain all original categories such that 
i) < X <ig,..., and the last new category will contain all original categories such that Y > 7,—). 


For a nominal predictor, all categories excluded at step 1 will be merged into the new category 
with the lowest count. If there are ties on categories with the lowest counts, then ties are broken 
by selecting the category with the smallest value by ascending sort or lexical order of the original 
category values which formed the new categories with the lowest counts. 
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Bivariate statistics calculation of new category 


When two categories are merged into a new category, we need to calculate the bivariate statistics 
of this new category. 


Scale target. If the categories i and ican be merged based on p-value, then the bivariate statistics 
should be calculated as: 


Ney /, = N.4 Ny 
Wi. = Wi. + Wy 
= Wy. _ 
Fry aoa Yi ; ! ae (Wy . Yi ) 
9 9 a) em Apes =~ 2 Z - a, 2 
M2, = Mp. Mg + We Gi i.) KWo. Gig We ) 


Categorical target. If the categories i aridcan be merged based on p-value, then the bivariate 
statistics should be calculated as: 


Niej = Nig + Nyy 


Wii's => Wi; + Wij 


Update univariate and bivariate statistics 


At the end of the supervised merge step, we calculate the bivariate statistics for each new category. 
For univariate statistics, the counts for each new category will be sum of the counts of each 
original categories which formed the new category. Then we update other statistics according to 
the formulas in “Univariate Statistics Collection”, though note that the statistics only need to be 
updated based on the new categories and the numbers of cases in these categories. 


P-value Calculations 


Each p-value calculation is based on the appropriate statistical test of association between the 
predictor and target. 


Scale target 


We calculate an F statistic: 


Tha We. Fe)? (L121) 
Se 1 M?/ (xi LNi- -1) 


F = 
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= yo Medi 
where 7, = SS : 


Based on F statistics, the p-value can be derived as 


L 
p= Pr | ty NS a 
i=1 


where F’ (U -1,0 = savecne !) is arandom variable following a F distribution with J — 1 and 
ae iNest ie of freedom. 
At the merge step we calculate the F statistic and p-value between two categoriesi and i’ of X as 
ot f= =: 2 Som if = 2 
Wi(Ui. — Yi.) + We. (Ge. — Fix.) 
2 2 T % ¢ 
(M2? + M2 ) / (Ng. + Nye. — 2) 


p= Pr (F (1,Nj. + Nj. — 2) > F) 


F —. 


where 7, ,. is the mean of Y for a new category i,i merged by i and 7: 


and F (J — 1, N;. + N;,, — 2) is arandom variable following a F distribution with 1 and 
Ni. + Nj, — 2 degrees of freedom. 


Nominal target 


The null hypothesis of independence of X and Y is tested. First a contingency (or count) table is 
formed using classes of Y as columns and categories of the predictor X as rows. Then the expected 
cell frequencies under the null hypothesis are estimated. The observed cell frequencies and the 
expected cell frequencies are used to calculate the Pearson chi-squared statistic and the p-value: 


mA I \? 

Sy 

= Mii j 
where Vi; = >°,. ep det (te =tA yr = J) is the observed cell frequency and 71;; is the estimated 
expected cell frequency for cell (2; = 7, y;, = 7) following the independence model. If 17;; = 0, 
then Busty = = 0). How to estimate 1; ; is described below. 
The corresponding p-value is given by p = Pr (\7 > X7), where yj follows a chi-squared 
distribution with d = (.J — 1) (J — 1) degrees of freedom. 


When we investigate whether two categories i and i of X can be merged, the Pearson chi-squared 
Statistic is revised as 
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: ‘s A = ‘ 2 
o> (Nymeth) 


Miz 


j=l 
and the p-value is given by p = Pr(\5_, > X°). 
Ordinal target 


Suppose there are I categories of X, and J ordinal categories of Y. Then the null hypothesis of 

the independence of X and Y is tested against the row effects model (with the rows being the 
categories of X and columns the classes of Y) proposed by Goodman (1979). Two sets of expected 
cell frequencies, 771; ; (under the hypothesis of independence) and m ij (under the hypothesis that 
the data follow a row effects model), are both estimated. The likelihood ratio statistic is 


H2 =) mij In Gara rmi;/mij > 0 


i 0 else 
The p-value is given by p = Pr (\7_, > H”). 


Estimated expected cell frequencies (independence assumption) 


If analysis weights are specified, the expected cell frequency under the null hypothesis of 
independence is of the form 


3g SR 0 (9 
Mig = Wi; AD; 
F F = Wig ie ape x is 
where a; and 3; are parameters to be estimated, and w;; = ~~ if NV; > 0, otherwise w;; = 1. 
7 eras 7 = 


Parameter estimates ,,, 9j, and hence 77; ;, are obtained from the following iterative procedure. 


—1 


= (0) = (0))= (0) =>wW 
1. k 0. a: B; 1 ,,\ Wi; 
2 ak th) _ N; ot) _Ni 
——=T A(R) i ( 
a Wij B; yy Maj 
3 gh 1) = N'-4 
7 FS SGT 
1) p(k+1) 
m =WwW,.a 3 
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If max;,j mi We me < € (default is 0.001) or the number of iterations is larger than a 


1) o(k+1) 1) 
3 


threshold (default is 100), stop and output al" 8; 
Q;,3;,1m;. Otherwise, k = k + 1 and go to step 2. 


(k4 . . 
and m; ; as the final estimates 


Estimated expected cell frequencies (row effects model) 


In the row effects model, scores for classes of Y are needed. By default, s; (the order of a 
class of Y) is used as the class score. These orders will be standardized via the following linear 
transformation such that the largest score is 100 and the lowest score is 0. 


= 100 (sj z = Tain) | Binns ~— Simin) 


Where s* ;,, and s*,,, are the smallest and largest order, respectively. 


min max 


The expected cell frequency under the row effects model is given by 
—1 
Mig = Wj; a5 Bs Vi 


where 5 = =p oe ae, Doers jin which W.; = 5,W;,, anda;, 3;, and 7; are unknown 
parameters to be ee ey 


Parameter estimates 4, B ee and hence 77; ; are obtained from the following iterative procedure. 


(0) (0) _ (0) 
jo” Ot 


2. (k+1) N.; _ i(k) oN 


1. hha" =p Slo? = op 


6. mab +4 wae +} (sj-5 
(A4+1) — —-~1 (k41) p(k41) (_ (k+1) 
m SU OF 3 7. 


ty Oy en es | ‘a 


a 1) (k) 
ma; —m 


;; | < € default is 0. oi or the number of iterations is larger than a 


(k+1 
and m; ; 


If max; ; 


threshold 1 eta is 100), stop and output a\"*"?, Bi a il 


;,™m;;- Otherwise, k = k + 1 and go to step 2. 


' as the final estimates 


Unsupervised Merge 


If there is no target, we merge categories based on counts. Suppose that X has I categories which 
are sorted in ascending order. For an ordinal predictor, we sort it according to its values, while 
for nominal predictor we rearrange categories from lowest to highest count, with ties broken 
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by ascending sort or lexical order of the data values. Let c; be the number of cases for the ith 
category, and Vy be the total number of cases for X. Then we use the equal frequency method 
to merge sparse categories. 


1. Start with 7; = 7. =1 and g=1. 
2. If 7, > I, goto step 5. 


3. If pee cy < [b% x Nxh then j, = j, + 1; otherwise the original categories ;, | j, + 1,---, j, will 


be merged into the new category g and let j; — j, 44, J2— j, and 1. then go to step 2. 


I-90 


4. If jo > J, then merge categories using one of the following rules: 


i) If g = 1, then categories 1,2,.--, J — 1 will be merged into category g and I will be left 
unmerged. 


ii) If g=2, then j,,7, + 1,---,J will be merged into category g=2. 
iii) If g>2, then j,, 7; +1,---, J will be merged into category g — 1. 
If j2 < J, then go to step 3. 


5. Output the merge rule and merged predictor. 


After merging, one of the following rules holds: 


m Neither the original category nor any category created during merging has fewer than 
[b% x Nx] cases, where b is a user-specified parameter satisfying 1 < 6 < 100 (default is 
10) and [x] denotes the nearest integer of x. 


m The merged predictor has only two categories. 


Update univariate statistics. When original categorigs, j; + 1,---,j2 are merged into one new 
category, then the number of cases in this new category will be es. cj. At the end of the 
merge step, we get new categories and the number of cases in each category. Then we update 
other statistics according to the formulas in “Univariate Statistics Collection”, though note that 
the statistics only need to be updated based on the new categories and the numbers 

of cases in these categories. 


Continuous Predictor Handling 


Continuous predictor handling includes supervised binning when the target is categorical, 
predictor selection when the target is continuous and predictor construction when the target is 
continuous or there is no target in the dataset. 


After handling continuous predictors, we collect univariate statistics for derived or constructed 
predictors according to the formulas in “Univariate Statistics Collection”. Any derived predictors 
that are constant, or have all missing values, are excluded from further analysis. 


Automated Data Preparation Algorithms 


Supervised Binning 


If there is a categorical target, then we will transform each continuous predictor to an ordinal 
predictor using supervised binning. Suppose that we have already collected the bivariate statistics 
between the categorical target and a continuous predictor. Using the notations introduced in 
“Bivariate Statistics Collection”, the homogeneous subset will be identified by the Scheffe 
method as follows: 


If |%.; —@.;| < ceritica: then x.; and 7.; will be a homogeneous subset, where if 
Ceritical = Max (Xj) —min(%.;) Nyy = J; otherwisec,,itica1 = R*C where 


AJ , 
R= \/2(J—1) Fi_-a (J—1, Nxy — J) andC = MS x _ ree MS= 


The supervised algorithm follows: 
1. Sort the means z.; in ascending order, denote as 7.) < 72) <+++ <TvJ). 
2. Start with i=1 and q=J. 


3. If |x then {Zr } can be considered a homogeneous subset. At the 
Dae Wk) @ (ky 
— See 4) and 


(q) Ti)| < Ceritical? (iyo tts Liq) 


same time we compute the mean and standard deviation of this subset: 7, 
sd.(i,g) VN where M?. , Tent A(ey and Any = M2.) + Wieey (Fig) — Frey) > 
then set i = q + 1 and q = J; Otherwise q = q-1. 

4. Ifi< J, goto step 3. 


5. Else compute the cut point of bins. Suppose we have r < .J homogeneous subsets and we 
assume that the means of these subsets are 7", ,),7".),+++,7"/,, and standard deviations are 


sd*,),sd*),-++, 8d"), then the cut points between the ith and (i+1)th homogeneous subsets are 


sd be § 
> — 7* j “) >7* =. me 
computed as cut; = 7/;) (sds, +adr,.,. +2) (Fis) By). 
(i) (i+1) 


6. Output the binning rules. Category 1: XY < cut ,; Category 2: cut; < X < cutg}...; Category 
> cutpy < X. 


Feature Selection and Construction 


If there is a continuous target, we perform predictor selection using p-values derived from the 
correlation or partial correlation between the predictors and the target. The selected predictors are 
grouped if they are highly correlated. In each group, we will derive a new predictor using principal 
component analysis. However, if there is no target, we will do not implement predictor selection. 


To identify highly correlated predictors, we compute the correlation between a scale and a group as 
follows: suppose that X is a continuous predictor and continuous predictors X,, X»2,---,,,, form 
a group G. Then the correlation between X and group G is defined as: 


rxg = min {|rxx,|,x7 € G} 


where rx x, is correlation between X and X;. 
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Let @4;0%) be the correlation level at which the predictors are identified as groups. The predictor 
selection and predictor construction algorithm is as follows: 


1. (Target is continuous and predictor selection is in effect ) If the p-value between a continuous 
predictor and target is larger than a threshold (default is 0.05), then we remove this predictor from 
the correlation matrix and covariance matrix. See “Correlation and Partial Correlation” 
for details on computing these p-values. 


2. Start with a = 0.9 and i=1. 


group 


3. If @g-oup < 0.1, stop and output all the derived predictors, their source predictors and coefficient 
of each source predictor. In addition, output the remaining predictors in the correlation matrix. 


4. Find the two most correlated predictors such that their correlation in absolute value is larger than 
Qgroup» and put them in group i. If there are no predictors to be chosen, then go to step 9. 


5. Add one predictor to group i such that the predictor is most correlated with group i and the 
correlation is larger than a,,...,,,. Repeat this step until the number of predictors in group i is 
greater than a threshold (default is 5) or there is no predictor to be chosen. 


6. Derive a new predictor from the group i using principal component analysis. For more 
information, see the topic “Principal Component Analysis”. 


7. (Both predictor selection and predictor construction are in effect) Compute partial correlations 
between the other continuous predictors and the target, controlling for values of the new predictor. 
Also compute the p-values based on partial correlation. See “Correlation and Partial Correlation” 
for details on computing these p-values. If the p-value based on partial correlation between a 
continuous predictor and continuous target is larger than a threshold (default is 0.05), then 
remove this predictor from the correlation and covariance matrices. 


8. Remove predictors that are in the group from the correlation matrix. Then let i=i+1 and go to 
step 4. 


9. Agroup = Agroup — 0.1, then go to step 3. 


Notes: 


m If only predictor selection is needed, then only step 1 is implemented. If only predictor 
construction is needed, then we implement all steps except step 1 and step 7. If both predictor 
selection and predictor construction are needed, then all steps are implemented. 


m If there are ties on correlations when we identify highly correlated predictors, the ties will be 
broken by selecting the predictor with the smallest index in dataset. 


Principal Component Analysis 


Let X,, Y2,---,X,,, be m continuous predictors. Principal component analysis can be described 
as follows: 
1. Input C,,,.m, the covariance matrix of Y,, X2,---,X),. 


2. Calculate the eigenvectors and eigenvalues of the covariance matrix. Sort the eigenvalues (and 
corresponding eigenvectors) in descending order, \; > Ag> +--+» >Am. 
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3. Derive new predictors. Suppose the elements of the first component 1, are 1), ¢j2,---, 1), then 
the new derived predictor is PENX - Oe Xot --- 4 Sn X 
Al 


VAI VAL 


Correlation and Partial Correlation 


Correlation and P-value 


Let rx be the correlation between continuous predictor X and continuous target Y, then the 
p-value is derived form the t test: 


p = Pr (|t(Nxy — 2)| >t) 
where ¢ (Vy y —2) is a random variable with a t distribution with Vy — 2 degrees of freedom, 
and t = rxya) Te. Ifr{y- = 1, then set p=0; If Nyy < 2, then set p=1. 
AY 
Partial correlation and P-value 


For two continuous variables, X and Y, we can calculate the partial correlation between them 
controlling for the values of a new continuous variable Z: 


PXY ~TXZ'TYZ 


Since the new variable Z is always a linear combination of several continuous variables, we 
compute the correlation of Z and a continuous variable using a property of the covariance rather 
than the original dataset. Suppose the new derived predictor Z is a linear combination of original 
predictors X,,X2,---,Xi,! 


Z=01,X1 + a2Xq +--+ amXm 
Then for any a continuous variable X (continuous predictor or continuous target), the correlation 
between X and Z is 
ook on - NCB 
CxS 
VOLZZEXX 


where czy = S~”” 


Lui 


1 7 2 a m 2 a ‘ a — . 
1 UiCXXs and CZZ = a 1 45 CX, X; TF 2 i4j Qj;AjCX;X;- 


If1—r{, or 1 —rj-z is less than 107", let ryyjz = 0. If ryyyz is larger than 1, then set it to 
1; If ryy\z is less than —1, then set it to —1. (This may occur with pairwise deletion). Based on 
partial correlation, the p-value is derived from the t test 


p= Pr (\t (Nyy — 3)| >t) 


where ¢ (yy — 3) is arandom variable with a t distribution with V.yy — 3 degrees of freedom, 
; = 1, then set p=0; if Nyy < 3, then set p=1. 
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Discretization of Continuous 
Predictors 


Discretization is used for calculating predictive power and creating histograms. 


Discretization for calculating predictive power 


If the transformed target is categorical, we use the equal width bins method to discretize a 
continuous predictor into a number of bins equal to the number of categories of the target. 
Variables considered for discretization include: 


m™ Scale predictors which have been recommended. 


= Original continuous variables of recommended predictors. 


Discretization for creating histograms 


We use the equal width bins method to discretize a continuous predictor into a maximum of 400 
bins. Variables considered for discretization include: 


= Recommended continuous variables. 

m Excluded continuous variables which have not been used to derive a new variable. 
m™ Original continuous variables of recommended variables. 
r 


Original continuous variables of excluded variables which have not been used to derive a 
new variable. 


m™ Scale variables used to construct new variables. If their original variables are also continuous, 
then the original variables will be discretized. 


= Date/time variables. 
After discretization, the number of cases and mean in each bin are collected to create histograms. 


Note: If an original predictor has been recast, then this recast version will be regarded as the 
“original” predictor. 


Predictive Power 


Collect bivariate statistics for predictive power 


We collect bivariate statistics between recommended predictors and the (transformed) target. If 
an original predictor of a recommended predictor exists, then we also collect bivariate statistics 
between this original predictor and the target; if an original predictor has a recast version, then 
we use the recast version. 


If the target is categorical, but a recommended predictor or its original predictor/recast version is 
continuous, then we discretize the continuous predictor using the method in “Discretization of 
Continuous Predictors” and collect bivariate statistics between the categorical target and the 
categorical predictors. 
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V 


Bivariate statistics between the predictors and target are same as those described in “Bivariate 
Statistics Collection”. 


Computing predictive power 


Predictive power is used to measure the usefulness of a predictor and is computed with respect 
to the (transformed) target. If an original predictor of a recommended predictor exists, then we 
also compute predictive power for this original predictor; if an original predictor has a recast 
version, then we use the recast version. 


Scale target. When the target is continuous, we fit a linear regression model and predictive 
power is computed as follows. 


™ Scale predictor: 


1 — +. where S, =a M?and Sp = 0", fiwi(yi Ve)’. 
™ Categorical predictor: i=l 


Categorical target. If the (transformed) target is categorical, then we fit a naive Bayes model 
and the classification accuracy will serve as predictive power. We discretize continuous 
predictors as described in “Discretization of Continuous Predictors”, so we only consider the 
predictive power of categorical predictors. 


ry aa \ J NT... Nn. I Nr. 
If N;; is the of number cases where* =‘ and Y = j. Ni. = D1; -1 Nij. and Nj = Di Nis. 


where Nij = “Nxy 


and Cramer’s V is defined as 


/2 


2 1/2 
= Xe | 
i (a (min (J,.J) — 5) 
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Bayesian One-WAY ANOVA Models 


The model 1 can be viewed as a special case of the general multiple linear regression model: 


M,i:y=1,0a+XB+e, (1) 


where y= (ya, +5 Yinas-+ + Yels +1 Yang) 3 = Mat. tng; OO = pe; B = (ou —Yp, He— pe, «+ R—-1— ee, 0); 


and 


X=|: fo: |. (2) 
| nr re 


Note that e ~ Normal(0, 071). 

Let f; denote the frequency weight for the i-th case in the n observations. A non-integer f; is rounded 
to the nearest integer. For those values less than 0.5 or missing, the corresponding case will not be used. 
The effective sample in the data set is thus N = ae f;. If no weights are present, N = n. Note that the 
sufficient sample size to estimate is N > k. 


Using Bayes Factor 


Considering the multiple linear regression model My, in Equation[I] we would like to compare this full model 
with a null model: 

Mo: y =1nate, (3) 
and test the null hypothesis Hp : 8 = 0. Note that a is a common parameter in both Mo and M,. We are 
interested in making inference on (3, but need to place appropriate priors on all of the unknown parameters 
including a, 3, and o?. In the following discussions, we let ¢ = 1/07, where ¢ denotes a precision parameter. 

Note: The following sections are the same as the sections in the “Bayesian Inference on Multiple Linear 
Regression Models” document. The only difference is substituting p with k — 1. Note that if we define 


fi 0... 0 
W=li i: 2], (4) 
0 0... fr 
then under one-way ANOVA setting, we have: 
mip O ... 0 
X™WX={ 2 2: : |, (5) 
0 O ... MKF 


— PRM t FN: . —_ 
where nj, = 2 eal fj, for i =1,...,k. 


Zellner’s Method 
Zellner once suggested a g prior broadly discussed under M, |Zellner, 1986}: 
© p(a,g|M1) = 1/6. 


e B\(¢,9, M1) ~ Normal (0 (xTWx)-"), where g is fixed. 


g 

¢ 

Since g is fixed, Zellner’s g prior has the computational efficiency. Under these settings, the Bayes factor 
suggested by Zellner between M, and Mo has a closed form 


3 = —(N-1)/2 
Mi a0" Fieger, (6) 


where g > 0, which is fixed and preset, and R? is the unadjusted proportion of variance accounted for by 
the factor which can be computed by applying the REGRESSION procedure on model [I] 
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Zellner-Siow’s Method 
Zellner and Siow proposed a Cauchy prior |Zellner and Siow, 1980], and can be represented as a mixture of 


priors with an Inverse-Gamma(1/2, N/2) prior on g: Under these settings, the Bayes factor suggested by 
Zellner and Siow between M, and Mo is 


7 = = —(N-1)/2 N/2 _379 _ 
Af = | (490-7 gioa [2 5-3/26-N/29) \ ag, (7) 
; ra/2) 


where ['(1/2) = \/z, and R? is defined the same as in the ”Zellner’s Method” section. 


Hyper-g Method 
Liang et al introduced a family of priors on g by specifying 


og) =“S*a +9)", (8) 


where g > 0 and a > 2 for a proper distribution |Liang et al., 2012|: Under these settings, the Bayes factor 
suggested by Liang et al between M, and Mog is 


At(a) = "5 2 fo 4g) N-F-9/2 [1 4 g(t — RATA? ag (9) 
(10) 


where a is preset, R? is defined the same as in the ” Zellner’s Method” section. 


Rouder’s Method 
Under these settings, the Bayes factor suggested by Rouder and Morey between M’; and Mo is 


- aa es —(N-—1)/2 8\/N/2 a _—Ns2 
dios) = | +9? +o — RYO (3 _ gMagN rao) dg, (11) 
0 


where s > 0, [(1/2) = /7, and R? is defined the same as in the ” Zellner’s Method” section. 


Characterizing Posterior Distributions 


The model 1 can also be viewed as another form of special case of the general multiple linear regression 
model: 


y=XG+e, (12) 
where y = (11, -0) Yiniy +++) Yk1y 9 Yong op M= M+... +4; B= (p1, fe2,---,Me)2; and 
In, On, --. On, 
X= : : ‘ (13) 
On, On, +--+. In, 


Note that € ~ Normal(0,¢7J). In the following presentation, we would like to mainly discuss how to 
make statistical inference on 3, and o? by using Bayesian approaches. 
We define the frequency weight f; and the matrix W the same way as in the ”Using Bayes Factor” 
section. Thus, we still have: 
N1f Ov as 0 
X™WX=[ 2: : 2: |, oo 
0 0... Tk, f 


= nyt...+ni Ge 
where nif = Dojonit.+n,_41 fi, for i =1,...,k. 
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Using Conjugate Prior 
We place a conjugate prior by assuming that 
e co? ~ Inverse-Gamma(apg, bg), 


e B\o? ~ Normal (0, 07Vo). 


Group means (: Under the setting of model {12} 3 = ([1, [2,-.., he)? represents the means of the k 
groups corresponding to the k categories of the factor. With the conjugate prior, the resulting marginal 
posterior distribution |X, y follows a scaled multivariate ¢ distribution with v degrees of freedom, where 
y=2aj9+ N. 

Before finding the Bayes estimator of G, we define the following quantities: 


By = (Vo! + X™WX) "(Vo 'Bo +X Wy) , (15) 
Vi =(Vot+XTWX), (16) 
a, =a + > j (17) 
by = bo + 5 (BEVi Bo + y" Wy - BEV; A) . (18) 

Hence, assuming v > 4, the mode and posterior mean of group means are both: 
6 =E(6|X,y) = fr, (19) 

and the variance-covariance matrix 

C(A|X,y) = ou, (20) 


where V,, a,, and 6,, are defined by Equations (16)-(18), and the diagonal elements are the variances of the 
elements in 3 = (11, fl2,---,4n)?. Define 


BY Bh. sue By 
Bs, Bs. ... Bs b 
Be! SC len, (21) 
: : : : ay 
BE Be a Bh 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju; such that 


ls € (ai ~IDF.T(1— <, v)./Be, fy + IDF.T(1 — <, v)/Bi ) (22) 


with the probability of c, where c = 0.05 by default, i= 1,2,...,k, fi; is the ith element in B = #(3|X,y), 
B* is the ith element on the diagonal of B*, and IDF.T(-) is the IBM@® SPSS® Statistics function for the 
inverse cumulative ¢ distribution. 


Variance of error terms o?: 


of a? is 


Under the setting of conjugate priors, the marginal posterior distribution 


o?|X,y ~ Inverse-Gamma(ay, b1) , (23) 


where a; and 6; are defined by Equations and (18), respectively. 
We may find the mode and posterior estimators of a? by computing the expected value 


bi 


a” = E(o"|X,y) = ; (24) 
ay — 1 
for a, > 1, and the variance of the marginal posterior distribution of o?|X,y 
b2 
C(o*|X,y) = ; (25) 


(a1 — 1)?(a — 2)’ 
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for a; > 2. We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering o? such that 
oe (IDF.GAMMA™"(1 se <, a, b, ), IDF.GAMMA~}( =; a1, b1)) (26) 


with the probability of c, where c = 0.05 by default, and IDF.GAMMA(-) is the IBM@® SPSS@® Statistics 
function for the inverse cumulative Gamma distribution. 


Using Standard Noninformative Prior 


By setting vi > 0, dg = —k/2, and bp = 0, it turns out that we place a reference (non-informative) prior 
by assuming that 


p(B,o7) x 1/o?. (27) 


Group means 2: Under the setting of Equation (27), the resulting marginal posterior distribution G|X,y 
follows a scaled multivariate ¢t distribution with v = N — k degrees of freedom. We can also find the mode 
and posterior estimators of 8 by computing the expected value 


‘ : -1 7 - 
B=E(B|X,y) =(X*WX) XTWy = (Hi, Te) (28) 
Ry te bMg fis 
where 7, = 2S ti and the variance-covariance matrix 
eS H fj 
gjenyt...tnj,_,41 45 
1 0... 0 
-1 V oe 
C(A|X,y) = s°(X'WX) = e |) 2 2 3 = he (29) 
y—2 y—2 : . : : 
0 O — 
Vk, f 
where 
. . 1 k nit...+n; 
P=—(y-XB)"Wiy-XB)=—-)) ilu - 4)? 
t=1 G=n14+...+nj3-141 
1 k nmyt...tn; k 
=D) Db fui - dl ans,s), (30) 
a1 g=nit...tnzj_it1 i=1 
and the diagonal elements are the variances of the elements in 3 = (11, [2,.--, We). 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju; such that 


c 2 


[i € (a =IDETI=5,%) 


1, +IDF.T(— ©, »),/* (31) 
t . TB. ) 
Nf PE 2 Naf 


with the probability of c, where c = 0.05 by default, i = 1,2,...,k, fi; is the ith element in B. 


Variance of error terms o?: Under the prior setting of (27), the marginal posterior distribution of 0? 
is 
2 ore, 2 
a |X, y ~ Inverse-y“*(v, 8“) , (32) 
where vy = N —k, and s? is defined by Equation (30). 
We may find the mode and posterior estimators of 7, when v > 4, by computing the expected value 


aa 


6°? = E(o7|X,y) = s*, (33) 
y—2 
and the variance of the marginal posterior distribution of o?7|X,y 
9) 2 
C(o?|X,y) = ——. 1". (34) 


(v— 22 —4) 
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We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering o? such that 


2 a aaa Serie 
oe (IDP.GAMMA (1-5, 5, $5”), IDF.GAMMA~1(5, 5, os be (35) 


with the probability of c, where c = 0.05 by default. 
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Bayesian Inference for Pearson Correlation 


Basic Statistics and Quantities in Estimating Sample Correlation Coefficient 
Notations 
The following notations defined in this section will be used for the subsequent sections. 

N: Number of cases. 

xj: Observed value of the scale variable X = (X1, X2,..., Xn) for the i-th case. 

yi: Observed value of the scale variable Y = (Y1, Yo,..., Yn) for the i-th case 


w;: Weight for the i-th case. Non-integer frequency weights are rounded to the nearest integer. For 
values less than 0.5 or missing, the corresponding case will not be used. 


W,: Sum of weights of cases used in computation of statistics for variable X. W, = N if no weights is 
present. 


W,: Sum of weights of cases used in computation of statistics for variable Y. W, = N if no weights is 
present. 


W,y: Sum of weights of cases used in computation of statistics for variables X and Y. W,, = N if no 
weights is present. 


Basic Statistics and Quantities 


Suppose there are a set of N ordered pairs of observations. We assume that the pairs are independent of 
each other, while the observations of the same pair, x; and y; may be correlated. To estimate the sample 
correlation coefficient r, we may need to compute the following statistics. 


1 N 
Estimated sample mean: % = Ww y a Wits - (1) 
1 1 N 
Estimated sample mean: y = Ww y 1 Wid: (2) 
Yo te 
; : 5 1 N s N 2 
Estimated sample variance: s7, = Wc ) 2 iti ) Wiki /W,| . (3) 
; : ‘ 1 N é N 4 
Estimated sample variance: sj, = Wot ) 5a Wii S a Wii /Wy| . (4) 
= = 


The estimated cross-product deviation for variables X and Y is 


N N N 
Cry = » Wiliyi — (>: vs) (>: vn) [Way - (5) 
i=l i=1 oats 


The estimated covariance is thus 


C. 
C(x, Y) = —~— 6 
(KY) = yy: (6) 
and the estimated Pearson correlation efficient is 
Cr 
Toy = u ; (7) 
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It is also convenient to define 


N N N 
Sxx =) wi(xi — 2), Syy =Souily-7, and Sxy = So wi(xi — 2)(yi-9)- (8) 
i=1 i=1 i=1 
Hence, the estimated sample correlation coefficient is 
Sxy (9) 


gy = 
Y VSxxSyy 


Bayesian Inference on the Correlation Coefficient 
Using Bayes Factor 
Bayes Factor Based on the JZS Prior 


The Bayes factor suggested by |Wetzels and Wagenmakers, 2012) under the JZS prior is 


( yay ‘i 7 
A = Wevl Py gy Weu-2)/21, 4 (1 — 72) gq —Wev-1)/2 9-3/2 ¢-Weu/(29) g 10 
p= Gal [a+ +r hg ge a) 


where ['(1/2) = /z, and r (|r| 4 1) is the sample correlation coefficient which can be estimated by either 
Equation or Equation (9). Therefore, the Bayes factor in favor of the null hypothesis is Ap; = 1/Ayo, 
with Ajo defined by Equation (10). In case that the two variables have a perfect linear correlation, or |r| = 1, 
the integral in Equation does not converge. In this scenario, we do not estimate the Bayes factor based 
on the JZS prior. Note that the sufficient sample size to estimate the Bayes factor is Wz, > 2. 


Fractional Bayes Factor 


The Bayes factor suggested by |Kang et al., 2001) is 


K(a,y) I2(x,y;6) 
he = ri 
i Toa nea ay) 


where 
h(2,y) = fa — pe) Wey DPy-t yen + Vi? — 2rpo| Wa (12) 
oe [ [a _ p2) Wey-3)/2y-1 ae 4 Vie 2rp| —(Way—1) dVap, (13) 
Ha(enid) = f= 08) DV [YH + V9 arp] OT AV, (14) 
Ip(x,y3b) = [. fa — p?)OWay-3)/27—1 ly +Vi? — 2rp| ee i, (15) 


and r is defined by Equation (7). Note that the fraction b € (0,1), which is preset and specified by users. 
Similar to what aforementioned in the previous section, in case that the two variables have a perfect linear 
correlation, or |r| = 1, we do not estimate the Bayes factor the fractional Bayes factor. 


Characterizing Posterior Distributions 


The sufficient sample size to estimate the posterior distribution is W,, > 2. Suppose E(X) = A, E(Y) = p, 
V(X) = ¢, and V(Y) = vy. We assume and place standard reference priors on A, 4, ¢, and ~. To derive 
the posterior density of p, we use the following substitution and approximation discussed in 
by noting that 


(1 — p?)(Wey-D)/2 
TF ee 


Pr(p|X,Y) x p(p) (16) 
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where p(p) is the prior density placed on p. The common choice of the prior has the form 


p(p) x (1— p*)°, (17) 
where c = 0 and c = —3/2 are two popular choices. Theoretically, uses are allowed to specify any arbitrary 
c € (—co, +00). 

After making the hyperbolic tangent transformation 


p=tanh(€) and r=tanh(z), (18) 


where tanh (z) = sinh (z)/ cosh (z) = (e* — e~*)/(e* + e~*) and |r| 4 1, we will finally have 


€ ~ Normal (z,1/W2,) for large Way. (19) 
We also suggest 
€ ~ Normal = : (20) 
ee Wey’ Wey 1542.50 —7) )? 


which is a slightly better approximation when a uniform prior is placed on p. In practice, we can stick with 


Equation (20). 


To find the Bayes estimators, we can simulate € based on Equation or (20), and then transform to 
p by using p = tanh (€). Define 
e = (0, 0, ney) (21) 


where I (I = 104 by default) is a larger integer input from syntax, denoting the posterior samples that we 
finally collect. We may find the Bayes estimators of p by computing the mode 


p= max {Pr(p|X,¥)} , (22) 


the expected value 


7 * il a 
2(o|X,Y) = f pP(olX,¥) ap Ep") =F) 0%, (23) 
p 


and the variance of the marginal posterior distribution 


WOOIX.Y) = f ePrOLX,Y) dp — EGIX.Y IP © 5 (0) ~ BUX)? (24) 


We can compare the estimated E(p|X,Y) and the null value to see whether there is a significant difference 
between them. We may also use V(p|X, Y) to evaluate the precision of the expected value we have computed 
from the posterior distribution. 

To maintain a more comprehensive Bayes estimator, we can construct a 100(1 — a)% highest density 
region (HDR), which is the smallest interval with a mass of 100(1 — a)% of the distribution. By definition, 
that is to find a Bayesian confidence interval satisfying 


He/2 
[PX Y)dp= 1-0, (25) 
Pays 


where the length of (La/2,Ha/2) is the shortest among all the candidate pairs. 
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Two-Sample Bayesian Inference on Normal Distribution 


Bayes-Factor Two-Sample Inference 
Notations 


The following notations defined in this section will be used for the subsequent sections. 
k: Group index, k = 1,2. 
Lei: Observed value of variable X for the i-th case in group k. 


wri: Weight for the i-th case in group k. Non-integer frequency weights are rounded to the nearest 
integer. For values less than 0.5 or missing, the corresponding case will not be used. 


N;: Number of cases in the data set for group k. 
W,: Sum of weights of cases in group k, Wy = peels wri We = Nz if no weights are present. 


Basic Statistics for Two-Sample Unpaired t-Test 


The Bayes factor for one-sample t-test can be extended to a two-sample unpaired design. Correspondingly, 
the following statistics are computed in a conventional way. 


1 Nx 


Sample mean 7; = Wes Whilki - (1) 
Group mean difference d = %2 — 71. (2) 
1 Ne Ne ? 

: 2 La ae) 

Sample variance s;, = W,-1 ae Wil ty; Ge writs rw] ; (3) 

Sample standard deviation s;, = 4/57. (4) 

Pooled Test Statistics 
— 1)s? — 1)s? 
Pooled variance of the mean difference s? = Wie aaa , (5) 
a W1,+W2-2 

Pooled standard deviation of the mean difference s, = ,/s?. (6) 
: 1 1 

Pooled standard error of the difference sq = s, ,4/—+——. (7) 
W, We 


d 
Observed t-statistic with pooled variance ¢t = — , with (W, + W2 — 2) degrees of freedom. _(8) 
Sd 


Pooled significance (2-tailed) Sig. (2-tailed) = 2 [1 — CdfT(|t|,Wi + W2 — 2)] . (9) 
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Unpooled Test Statistics 


2 2 
Unpooled standard error of the difference sy = ee (10) 
Wi We 


d 
Observed t-statistic with unpooled variance t = — , with 1/(Z1 + Z2) degrees of freedom, where 
Sd 


(11) 
82 /We 7: 
A= (= W, - 1). 12 
: (as a) ae 02) 
Unpooled significance (2-tailed) Sig. (2-tailed) = 2[{1 — CdfT(|t],1/(Z21 + Z2))] . (13) 
Rouder’s Method 
The Bayes factor for two-sample unpaired t-test under the Rouder’s method is 
t? —(v+1)/2 
(+5) 
Bry = . (14) 
mp 1/2 e sea 1/2,,—3/2,,-1/(29) 
1+ Ng)” 1+ ———_ 27)” ~ THEME? 
[ a+ 9) G+) (2n) ge g 


where ¢ is the pooled-variance two-sample t-statistic defined by Equation (8); N = W,\W2/(Wi + Wa); 
vy =W,+W? — 2; and g is the variable to be integrated out. 


Gonen’s Method 


The Bayes factor for two-sample unpaired t-test under the Gonen’s method is 


T,(tl0.1 PDF.T(t,v) 4/1 + No? 
gBFg, = (#10, 1) (15) 


2 = 
T(t\VNA,1+No3)  Nppr-T (uy + No} UNA y/1+Ne}) 


where t is the pooled-variance two-sample t-statistic defined by Equation (8); y=W14+W2.-2; N= 
W,W2/(W1+W2); \ and o? denote the prior mean and variance of (41 — j42)/o; T,(-) denotes the noncentral 
t probability density function; and PDF.T(-) and NPDF.T are the IBM@® SPSS@® Statistics probability 
density functions for the (noncentral) ¢ distribution. 

It is quite natural to assume that \ = 0. For the case where the prior mean of the effect size is assumed 
to be zero, Equation can be reduced to 


—(v+1)/2 
1+? /v 
BFo1,,=0 = ,/1+ No?. 16 
sen ed (atte) me ue) 


The sufficient sample size to estimate is W,, W2 > 1. 


Hyper-Prior Method 


The Bayes factor for two-sample unpaired t-test under the hyper prior of of is 


1+t?/v li 2) —1/2 2) 42 
BE =9 = 1+N dox . le 
PBF 10,,=0 | (<= oA) ( + 05) m(a5) dos (17) 


1 5 
Set « = N and b= _ —a~5, Equation (17) can be reduced to a closed form 
T'(v/2) F(a + 3/2) iia 
PBF 10,,=0 = 14 ; (18) 
T((v + 1)/2) Pa +1) v 
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where I'(-) denotes the Gamma function, ¢ is the pooled-variance two-sample t-statistic defined by Equation 
(8); vy =W,+W2 — 2; a is input by users (a = —0.75 is the setting by default), and it is recommended that 
the choice of a € (—1,—1/2] |Wang and Liu, 2016]. Note that when we output the Bayes factor estimated 


by using the Hyper-prior method, the value has to be in favor of H, versus Ho. 
The sufficient sample size to estimate is W,, W2 > 1. 
Bayesian Two-Sample Inference By Estimating Posterior Distributions 


Bayesian Two-Sample Inference Using Conjugate and Noninformative Priors 
Notations 
The following notations defined in this section will be used for the subsequent sections. 
X:  Arandom variable to be tested whose values are observed for Group 1. We assume X ~ Normal(z,,,, 02). 
Y: A random variable to be tested whose values are observed for Group 2. We assume Y ~ Normal(z,,,, G,). 
Zy,: Mean parameter of X. 
Zy,: Mean parameter of Y. 
d,: Mean parameter of Y — X, where dy, = zy, — Zy,- 
Oy: Standard deviation parameter of X. 
dy: Standard deviation parameter of Y. 
N,: Number of cases in the data set for group X. 


N,: Number of cases in the data set for group Y. 


Wzj: Weight for the j-th case in X. Non-integer frequency weights are rounded to the nearest integer. 
For values less than 0.5 or missing, the corresponding case will not be used. 


wWyi: Weight for the 7-th case in Y. Non-integer frequency weights are rounded to the nearest integer. 
For values less than 0.5 or missing, the corresponding case will not be used. 


W,: Sum of weights of cases W, = ye Wai. Wz = N, if no weights are present. 


W,: Sum of weights of cases W, = an] wy. W, = N, if no weights are present. 


Diffuse Priors with Known Variances 


2 


In this section, we assume that both 02 and oy 


that p(zy,|o2) « 1 and p(z,,|o7) « 1. 
Under this setting, we are interested in drawing inference on d,,. Thus, the marginal posterior distribution 
of d,, is 


are known, and place the independent diffuse priors by noting 


d,|(X,Y) ~ Normal(tun, 02), (19) 
where 
Ny Nz 2 2 
1 1 oO oO 
the Se Wyii — FD, Wess and =a tae (20) 
Y j=1 4 f=1 . * 


the expected value 
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and the variance of the marginal posterior distribution of d,,|(X,Y) 
V[dul(X,Y)] =o - (23) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering d,, such that 


du € (1atNorm(5, Lin, 02), IdfNorm(1 — 7 bres Vo?) ) (24) 


with the probability of c, where c = 0.05 by default. 
The sufficient sample size to estimate is Wz, Wy > 0. 


Normal Priors with Known Variances 


2 


In this section, we assume that both o2 and a, 


Normal (fiz9,0%,) and 2, ~ Normal ({iy,,0;,)- 
Under this setting, we are interested in drawing inference on d,,. Thus, the marginal posterior distribution 
of d, is 


are known, and place the independent normal priors z,,, ~ 


dj|(X, Y) i, Normal (thy: _ Me, ’ oF =P oe) ’ (25) 
where 
1 W,\ Ly . OW, 
Ou, = (= - my) 1 Baim = Fy, (4 tg) (26) 
Py Fy Fy Py 
and 
1  W,\ m zw. 
2 x _ 2 x Zz 
Ox, _— (= a “ ) Le, — ox, (4 + o2 ) * (27) 
xo x xO x 


The computation of % and y is the same as in Equation (20). We may find the Bayes estimators of d, by 
computing the mode 


dy = byn — ban > (28) 


the expected value 


i(d,|X,Y) = hyn — Pan > (29) 


and the variance of the marginal posterior distribution of d,,|(X,Y) 


Vi(du|X,Y) = i +o... (30) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering d,, such that 


d,, € (1atNorm(5, Lyn — bans \/%,, +02, ), IdfNorm(1 — - Ly, — Lams \/0%,, ae. )) (31) 


with the probability of c, where c = 0.05 by default. 
The sufficient sample size to estimate is Wz, Wy > 0. 


Diffuse-Jeffreys Priors with Equal Variances 


In this section, we assume that 03 = 07 = 07, and p(o”) « 1/07. We place the diffuse priors by noting that 
P(Zp,|07) « 1 and p(z,,|o7) « 1. 

Under this setting, we are interested in drawing inference on d,,. Thus, the marginal posterior distribution 
of d,, is 


dy \(X, a ~ tu, (Bns a7) ’ (32) 


where 
Up = Wz + Wy — 2, (33) 


BAYES INDEPENDENT Algorithms 


1 Ny 1 Ne 
in =Y-D= FD Wy — GF So wasn, (34) 
Y j=1 v j=1 
and 
i ee ae 1 1 
2 -\2 =)\2 
= Yi — ajl&j — ore serene aalll 35 
c= 5 [Sewilie WP +S my 2 GtE) (35) 
We may find the Bayes estimators of d,, by computing the mode 
d.=9-Z, (36) 
the expected value 
D[dul(X,Y)]=y9-2, (37) 
and the variance of the marginal posterior distribution of u,.|X 
V[d,\(XY)]=—*5 0%. (38) 
Un — 2 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju. such that 


pa € (fm — IAfT(1 57 Yn) VOR» Hn + TAET(1 — 5, Un) /8 ) (39) 


with the probability of c, where c = 0.05 by default. 
The sufficient sample size to estimate is Wz, Wy > 1. 


Diffuse-Inverse Chi-Square Priors with Equal Variances 


In this section, we assume that 02 = 04 = 07 and o? ~ Inverse-x?(V, 95), and rewrite X ~ Normal(z,,,,07) 


and Y ~ Normal(z,,, + d,,07). We place the diffuse priors by noting that p(z,,|o7) « 1 and p(z,,|07) « 1. 
Under this setting, we are interested in drawing inference on d,,. Thus, the marginal posterior distribution 
of d,, is 


dy |\(X, ae) tu, (Uris on) ) (40) 
where 
1 Ny 1 Naz 
Un =V9 +Wez + Wy, - 2, Hn =9-E= 7 Wyidi — FD, Wass (41) 
¥ j=1 ® j=1 
and 
1 vy Ns ir be 
te i=l j=l y * 


We may find the Bayes estimators of d,, by computing the mode 


dy, =: y —& ’ (43) 
the expected value 
E[dul(X,Y)]=y9-2, (44) 
and the variance of the marginal posterior distribution of fu.|X 
V[dul(X,¥)]=—*5 08. (45) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju. such that 
lly € (in —IdfT (1 — 7 Un) o2 , fn + IdfT (1 — 7 Un) 7 ) (46) 
with the probability of c, where c = 0.05 by default. 
The sufficient sample size to estimate is Wz, Wy > 1. 
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Diffuse Priors with Unequal Variances 


In this section, we do not make any assumptions on the equality of a? and a We place the diffuse priors 


on all of the parameters by noting that p(zj,,Z.,02;07) « 1. 


Define we 
yw = arctan (oe) , (47) 
where 
N, 1/2 ro 1/2 
sy =| Dov —o2/W,-D) and sp=[ Sowaj(e;-2)?/(We- 1). (48) 
i=1 j=l 
Note that 
T= dy ~ (J ~2) = T, sin — T, cos pp , (49) 
4/ 82/W2 + 82/Wy 
where 
T, = ar Miypt. |W; ee (50) 
and w is defined by Equation (47). Hence, 
T ~ Behrens-Fisher(W, — 1,W, — 1,7). (51) 


In practice, we have to approximate Equation according to the dicussions in |Patil, 1965) by finding 


n= (Ws) are (WES) ee 

_ (Wy - Te (W, - Ly 
2 ™ (Wy —3)(W, —5) (W,, — 3)2(W;, — 5) 
m3 =4+ni/n2, and m= Vm(n3 — 2)/n3- (52) 


Under this setting, the marginal posterior distribution of d,, is 


sin’ w 4 cos* yw, 


dy\(X,Y) ~ ty, (ns On) (53) 
where 
— 1 LS ee ee, 2 
Un =3, bn =Y-F= W, a Wii — de Wests , and of =i (s2/W2 + 54 /Wy) - (54) 


We may find the Bayes estimators of d,, by computing the mode 


du = y —&£ ) (55) 
the expected value 
E[dul(X,Y)]=y9-2, (56) 
and the variance of the marginal posterior distribution of |X 
V, 
A.) = og, 
Vdul(XY)] = 5 op (57) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju. such that 
lly € (in ~IdfT (1 — , Un) fo? , fn + IdfT (1 — : Un) = ) (58) 


with the probability of c, where c = 0.05 by default. 
The sufficient sample size to estimate is Wz, Wy > 5. 
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Bayesian Inference for the Independence of Two Factors 


General Notations 


We desire to test the null hypothesis Hp: No association between rows and columns versus H;: They are 
associated. The following notations defined in this section will be used for the subsequent sections. 


(ae r=1,2,...,R denoting the non-empty row index, where R > 2, and R is an integer. 
8: s=1,2,...,S denoting the non-empty column index, where S > 2, and S is an integer. 


Yas: A matrix containing all of the observed cell counts with 


Yi Yi2 +--+. Ys 
Y21 %Yo2 --- Yas 
Y=| . 3 . > (1) 
YRI YR2 «--- YRS 
where y,,; must be a nonnegative integer. 
y: ay = (yi, Yi2,---,YRs)', a vectorized ys» containing all of the observed cell counts. 


Yrs: Observed count data in the cell on the r-th row and the s-th column of the contingency table. Note 
that y;-; > 0, and y;, is an integer. 


Ur Yr. = an Yrs, the marginal total of the r-th row. 
Ys! Ys= aan Yrs, the marginal total of the s-th column. 
Y: Y= pan ee Yrs, the total count of the cells. 


Yrs: Expected count in the cell on the r-th row and the s-th column of the contingency table. y,, = 


Ury.s/Y. 
Yet Ye = (Y15Y.2)-+-5Y.9)', a vector containing marginal column sums, where S > 2. 
Yur Yu. = (Yt, Y2.5+--,YR.)*, a vector containing marginal row sums, where R > 2. 


Bayesian Inference by Using Bayes Factors 


To implement the following methods, we require the two factors both have the number of categories > 2 to 
formulate a valid two-way contingency table. Otherwise, we give a warning message, and do not conduct 
any further Bayesian analyses. 


Bayes Factors Based on Natural Conjugate Priors 


Gunel and Dickey, 1974] proposed a unified approach when considering the association between two factors 


in a contingency table under the different model settings. The general idea is to assume conjugate gamma 
priors for Poisson models, and then extend to the other further conditioned models. 
We let ax denote a matrix of prior shape parameters with the same dimension as Yu 


Qi a12 Pee ais 
a21 a2 fore a2s 

One= |. ; ; (2) 
GR1 GR2 .--- QRS 


where a;; > 0. drs = 1 is the setting by default. Users can overwrite this setting by specifying different 
values, the number of which must match that of y... We further define the following notations: 
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ad: d= (a11,@12,...,@Rg)", a vectorized ay» containing all of the prior shape parameters. 
A: A= yake Se Grs, the total count of the cells. 

X: X=A-(R-1\(S—1). 

Q.x: Gx = (@1,4.2,...,@,5)", a vector containing marginal column sums, where S > 2. 


T a vector containing marginal row sums, where R > 2. 


Ax. Ax, = (a1, AQ. 5+ oR.) 
Eva: Ex = @.» — (R-—1), or subtracts (R — 1) from each element of a... 
€4.: &«, = Ax, — (S — 1), or subtracts (S — 1) from each element of a,.. 


We also define the multivariate Beta function 


where a, > 0. 


Indepedent Poisson Sampling Models 


The Bayes factor for independence under the Poisson sampling models is 


1 (R-1)(S—1) = + X) ars Bly. + Ex.) B(y.« + Es) 
BFo = (1+5) TM ts Be.) Bee)” 7 


where R and S are determined by the numbers of categories found in the data sample; a,, and 6 are specified 
by users. Note that a,, = 1 and b= Rx S x min(a;,)/Y are the settings by default. 


Joint Multinomial Sampling Models 


Under this sampling scheme, the total number of observations Y is fixed. The cell counts are jointly multino- 
mially distributed, or (y11, yi2,---, yrs) ~ Multinomial(Y, 711, 712,...,7Rs), where bbe sai rs = 1. The 
prior distribution is the conjugate Dirichlet distribution (11, 712,...,7Rs) ~ Dirichlet (a11, a12, ...,QRg)- 
The Bayes factor for independence under the joint Multinomial sampling models is 


B(ys. + x.) Bly.e +€%) B(@) 
B(Ex.) B(éx) By + @) 


where a, is specified by users. Note that a,;; = 1 is the setting by default. 


BFo1 = (5) 


Independent Multinomial Sampling Models 


The Bayes factor for independence under the independent Multinomial sampling models when the row 
margins are fixed is 


B(E..«) Blas.) BC +a)’ 
where a; is specified by users. Note that a,, = 1 is the setting by default. Note that when the column 
margins are fixed, Equation (6) changes to 
B(ys. + &«.) Blys +a.x) B(@) (7) 
Bs.) Blas) BOF + @) 
Similarly, if €,, contains any components < 0, we set BF; to be missing, and give a warning message 


indicating that “Bayes factor cannot appropriately be estimated because at least one component for the 
prior is too small.” 


BFoi = 


BFo1 = 
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Bayes Factors Based on a Mixture of Symmetric Dirichlet Distributions 


The Bayes factors presented in this section are based on the methods discussed by |Good, 1976]. The work 
evaluated the independence in contingency tables by using a mixture of symmetric Dirichlet distributions. 
In the following presentation, we define 


n YL Pe rte) [Pim +k), (k\ ak 
Se Toa (F(R) bY + th) (7) ? 
=Tml — ®'(m,,t,t'), (8) 
where 
j= (9) 


k(n? + log? k)’ 


which is the specific log-Cauchy density function as a hyper prior suggested by |Good, 1976}. 
Under the null hypothesis Ho, the probabilities of the interior of a contingency table given the marginals, 
denoted by Pr_y, is 


R ! FS) r 
Pr-y = P(Yux|Yr.sY.s> Ho) = IT =1 Yr. T= Ys 


= . (10) 
Y! Tes Te Yrs! 


Joint Multinomial Sampling Models 


Under Ho when the total Y is fixed, the priors are Dirichlet(R,1) and Dirichlet($,1) for 7,. and 7s, 
respectively. Note that Dirichlet(R,1) and Dirichlet(S,1) are assumed to be independent. Under Hi, the 
prior is Dirichlet(RS,1) for 7,,. The proposed Bayes factor is 


ne B(yrs, RS, 1) _ (ys, RS, 1) (11) 
 ®(yr.,R,1) O(y.s,5,1)Pr_y © (yp, R, 1) © (ys, 8, 1)’ 


where ®’ denotes the integral part within Equation (8); ® and Pr_y are defined by Equations and (10), 
respectively. We compute BFo; = 1/BFi9 to output the Bayes factor in favor of the null hypothesis. 


Independent Multinomial Sampling Models 


Under Hp when the column sums are fixed, the prior is Dirichlet(S,1) for 7,,. The proposed Bayes factor is 


BF, = ®(y,s, RS, 1) = ®' (yrs, RS, 1) (12) 
° OG, BR, O(y.s,5,R)Pr_y By, R,1) O(y.s, 5, B)’ 


where ®’ denotes the integral part within Equation (8); ® and Pr_y are defined by Equations and (10), 
respectively. We compute BFo; = 1/BFio9 to output the Bayes factor in favor of the null hypothesis. By 
symmetry, if the row sums are fixed, we can switch the columns and rows in the contingency table, and 
apply Equation (12). 


Bayes Factors Based on Intrinsic Priors 


Casella and Moreno, 2009} proposed the Bayes factors based on intrinsic priors and posterior probabilities. 
Due to the computation hurdles, we only implement the methods discussed in this section for 2x2 contingency 
tables with R= S = 2. 

In the following presentation, we let z = {z,,} denote the possible design of a contingency table, and 
let the sign Las z,,<y denote the summation over z with all possible designs of the contingency table of 


Se a 
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Joint Multinomial Sampling Models 


The Bayes factor for independence based on the intrinsic prior under the joint Multinomial sampling models 
is 


(Y+RS—1)! y (es z,.!) (1 zit) 
mgt Scale 


© OY + RS— 1! ey M2) (Tw!) (Ta vs!) 


Y Y Yy! 
()=( a a 
z 7115 712) £21; 722 21! 219! ... Zerg! 


To conquer the computation hurdle, we introduce an additional parameter ¢ to control the training sample 
size, and rewrite Equation (13) to 


Zrg: 


R Ss i 
if I (23 T eel (13) 


where 


(t+RS—1)! L(V +REY +5) [ast eal) eae 
err aN ae ne ae a 9 nT Se a 


(15) 


where we set t = 500 for 2 by 2 contingency tables. If the observed grand total Y >t, we compute BF j9(t), 
and BF jo, otherwise. 

Finally, we compute BFo; = 1/BF 40, or BFoi(t) = 1/BFio(t), to output the Bayes factor in favor of the 
null hypothesis. In case the contingency table under analysis has the dimension exceeding 2 by 2, we will 
give a warning message, and compute the Bayes factor by using the method presented in the “Bayes Factors 
Based on a Mixture of Symmetric Dirichlet Distributions” section. 


Independent Multinomial Sampling Models 
Under the null hypothesis, the default marginal distribution is given by 


mo(y ) = ee TT (%) x Im (16) 
ok ry + S) = Ure = 8" 9 
where 
Ur. Ur. 
ss ry 17 
i Yr! Yra! tee yrs! ( ) 


The intrinsic marginal distribution under the independent Multinomial sampling models when the row sums 
are fixed is 


m1(Yax) =T(S) Il De ) Thai Pr +8) S Tse 23! il e ) Th, (ere + Yrs)! | 


r=1 Urx ny a 5) (Za Zaps ZRe)! | ear eee 25! r=1 Zr P(2yp. a S) 
Dos 2rs=Yr. 
(18) 
where 
Ur. Ur. . 
= 19 
Ge Zpr1! Zrq! ... Zrg! ce 
To conquer the computation hurdle, we may consider 
R Ss R Ss 
es Yr. \ Trai P(tr. +S) []s=1 2.s! tr. \ [[s=1 (rs + Yrs)! 
mr(Yuut) =2(S) [J = age a De Te 1, gat tt lene] Ptr tur +5)’ 
Urx (eagsem Ne 2R«): ihe 1 ft 1 Zigl = TK is Ur. 
SreHtr. 


(20) 


where we set t,, = 5000, and consider four different conditions as follows for a certain 2 by 2 contingency 
table design: 
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e When y;, > t;, and yz. > tz., use Equation by setting t = t,, + to; 

e When y), > t;, and yz. < tz., use Equation by setting t = t,, + ya, and te, = y2; 
e When yj, < t, and yo, > te, use Equation by setting t = y,, + te, and ty. = y1.; 
e When yj, < t;, and yz. < tz., use Equation (18). 

Thus, the desired Bayes factor is 


BF = atte of Bij see (21) 


My Yar ) M7 (You t) 
depending on the setting of t,... 

Note that the results are symmetrical in terms of the columns and the rows. If the column sums are 
fixed, we can switch the rows and columns in the contingency table, and apply Equation (21). In case 
the contingency table under analysis has the dimension exceeding 2 by 2, we will give a warning message, 
and compute the Bayes factor by using the method presented in the “Independent Multinomial Sampling 
Models” section with corresponding fixed row or column margins. 


Bayes Factors Based on Nonparametric Bayesian Models 


Quintana, 1998] proposed the Bayes factors based on nonparametric models and Dirichlet process priors. 
We only implement this method when R = 2 (or S = 2) with the row sums y,. (or column sums y.;) 
fixed. Note that the priors are the Dirichlet processes. Under this particular situation, however, the prior 
probabilities cancel out, which frees users from specifying prior information on weight. 

Let A = (Ai, A1,---, As), where A; > 0, and we define the Dirichlet prior 


P (Sa As) 


D(A) = =... 22 
= TTA.) 
Thus, the Bayes factor is 
_ Li (Yuu) 
BFo1 — Lo (Yur) ’ (23) 
where 
oe (24) 
MT DAF Yaa + Ya8) 
and 
D(A) D(A) 
Lo(Yux) = : 25 
2(Yax) D(A+ Y1%) 7 D(A+ Yaw) (28) 


Note that A is specified by users. We set A = 1 by default. The results are symmetrical in terms of the 
columns and the rows. If the column sums are fixed, we can switch the columns and rows in the contingency 
table, and apply Equation (23). 


Bayesian Inference by Constructing Credible Intervals 


In this section, we consider the model 
Trs = Aexp{a; ae Br =F Yn} ’ (26) 


where 7 = 1,2,...,R, k = 1,2,...,S, and AT! = eae exp{a; + 8x + 7%} with the restrictions 
ar = bs = jr = YRr = 0. To test the independence of two factors is equivalent to make inference on 7jx, 
where j = 1,2,...,R-—landk=1,2,...,5—1. 

Based on the model, proposed to draw a random sample from Dirichlet (1), and 


computed the posterior distribution of y;;. They finally applied the method discussed by |Besag et al., 1995 
to construct the desired simultaneous credible interval region which is a hyper-rectangular credible region 


for (R — 1)(S — 1) interaction effects in a two-way contingency table. Inference can be made by checking 
whether or not each interval contains 0. 


References 


[Besag et al., 1995] Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995). Bayesian computation and 
stochastic systems. Statistical science, pages 3-41. 


[Casella and Moreno, 2009] Casella, G. and Moreno, E. (2009). Assessing robustness of intrinsic tests of inde- 
pendence in two-way contingency tables. Journal of the American Statistical Association, 104(487):1261— 
1271. 


Good, 1976] Good, I. J. (1976). On the application of symmetric dirichlet distributions and their mixtures 
to contingency tables. The Annals of Statistics, pages 1159-1189. 


Gunel and Dickey, 1974] Gunel, E. and Dickey, J. (1974). Bayes factors for independence in contingency 
tables. Biometrika, pages 545-557. 


Nandram and Choi, 2007] Nandram, B. and Choi, J. W. (2007). Alternative tests of independence in two- 
way categorical tables. Journal of Data Science, 5(2):217—237. 


Quintana, 1998] Quintana, F. A. (1998). Nonparametric bayesian analysis for assessing homogeneity in 
kx 1 contingency tables with fixed right margin totals. Journal of the American Statistical Association, 
93(443):1140-1149. 


BAYES ONESAMPLE Algorithms 


BAYES ONESAMPLE Algorithms 


One-Sample Bayesian Inference on Normal Distribution 


Notations 


The following notations defined in this section will be used for the subsequent sections. 


z;: Observed value of variable X for the i-th case. 


yi: Observed value of variable Y for the i-th case 


w;: Frequency weight for the i-th case. A non-integer frequency weight is rounded to the nearest integer. 
For values less than 0.5 or missing, the corresponding case will not be used. 


N: Number of cases in the data set. 


W: Effective sample size W = Sun w;. W = N if no weights are present. 


Lio: Test value specified by the null hypothesis. 


Basic Statistics for One-Sample t-Test 


The Bayes factor for one-sample t-test, proposed by |Rouder et al., 2009], actually relies on the conventional 
t-statistic, the computation of which is discussed in this section. The following statistics are computed. 


a. a N 
Sample mean ¢ = W 2s Wik; . (1) 
1 N 
: eee , m2 
Sample variance s7, = Wo aan w; (aj; —Z)° . (2) 
Sample standard deviation s, = \/s?. (3) 
Sx 
Standard error of the mean s; = : 4 
VW “) 
Mean difference d = Z — Uo. (5) 
d 
Observed t-statistic t = — , with (W — 1) degrees of freedom. (6) 
SZ 
Significance (2-tailed) Sig. (2-tailed) = 2 [1 — CdfT(|¢|,W — 1)] . (7) 


Basic Statistics for Two-Sample Paired t-Test 


For Bayes factor two-sample paired t-test, the following statistics are computed. 


_ 1 N 
Sample mean ¢ = W Os Wid; - (8) 
_ i N 
Sample mean 7 = W ae WiYi - (9) 
Difference of the sample means d= Z-— 4. (10) 
Sampl i a Swit? Soa Ww 11 
ample variance 5, = 7 ee Wire — pe wir, ) / : (11) 
2 
Sampl i —— i 5 7 12 
ample variance sy = 7; Ss WiY; — oe wiyi) (WI. (12) 
: 1 N N N 
Covariance between X and Y Sszy = Wool oze WiLiYs — (oe, wits) oe wi) jw) . 
(13) 
Standard deviation of the mean difference sp = ,/s? + 52 — 28zy . (14) 
Standard error of the mean difference sq = (8 + 82 — 28a) /W. (15) 
d 
Observed t-statistic for equality of means t = — , with (W — 1) degrees of freedom. (16) 
Sd 
Significance (2-tailed) Sig. (2-tailed) = 2 [1 — CdfT(|¢|,W — 1)] . (17) 


Bayes Factor for One-Sample and Two-Sample Paired t-Test with Known Variance 


We can use the sufficient statistic X to formulate the Bayes factor under this setting 


Pr(a|p = pro) 
Pr(@|p F po) 


(2na3/W)-? exp |—5 (@— no)?/(02/W)) 


Boi = 


(n(v2 +03/1))-¥/2 exp |—5 (@~ no)?/(W? + 02/0)| 
= VIF exp [5 (@— o)*(02) WA +l 9))™), (as) 


where jig, 02 > 0 and g > 0 are specified a priori by users. 
For two-sample paired t-test, we can replace Z with d = y — & (see Equation (10)), and o? with 02 
(specified by users), respectively, in Equation to estimate the desired Bayes factor. 


Bayes Factor for One-Sample and Two-Sample Paired t-Test with Unknown Variance 


Suppose X; Normal(, 07), i = 1,2,...,.N, where o? is unknown, and we are interested in testing the null 
hypothesis Ho : 4: = 0 versus the alternative hypothesis H, : 1 4 0. We assume that pp ~ Normal(pio, 77) 
and p(o?) = 1/07. In addition, we further specify the relationship between 7? and o2 by letting ~? = go?, 
where g > 0. Thus, the Bayes factor under this setting is 


p2\ -W+D/2 
(: ¥ a 
Vv 
Bou => (19) 


on 2 \ =F? 
1+Wg)” 1+ ————_ 
atwa” (1+ ape) 
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where t is defined by Equation (7); vy = W —1; and g > 0 is set a priori by users. 
Note that Rouder et al proposed a more general approach we would like to consider here [Rouder et al., 2009). 

To construct the Bayes factor, we have to choose and place priors on both ys and o?. Let 6 = y/o, denoting 
the standardized effect size. It is then equivalent to test Hp : 6 = 0. One way to set the alternative hypoth- 
esis is to assume that H, : 6 ~ Normal(0, 03), where of is specified a priori [Gonen et al., 2005]. A couple 
of reasonable setting of of may include of = 1 or of ~ Inverse-x?(1). In this document, we assume that 

6 ~ Cauchy, which is a ¢ distribution with a single degree of freedom. For variance o?, we apply a standard 
setting of the Jeffreys prior with p(o”) = 1/0? (Jeffreys, 1998]. Such a combination of the Cauchy on effect 
size 5 and the Jeffreys prior on variance a? is coined JZS prior in [Rouder et al., 2009}. 


Thus, the Bayes factor for one-sample t-test with the JZS prior is 


#2 —(v+1)/2 
= 2 20) 
Boi = = Pp —(+1)/2 , ( 
| (1 + Wg)7*/? (1 + os! (Qa) 1/2. 9-3/2 e—1/(2s) dg 


where ¢ is defined by Equation (7); v = W —1,; and g is the variable to be integrated out. 
For two-sample paired t-test, both Equation and apply with the substitution of t computed by 


Equation (16). 

Bayesian One-Sample Inference on Mean By Characterizing Posterior Distribu- 
tions 

Bayesian One-Sample Inference on Mean Using Conjugate and Noninformative Priors 
Notations 

The following notations defined in this section will be used for the subsequent sections. 


X: A random variable to be tested whose values are observed. We assume X ~ Normal(iz,02), where 
both jz, and co? are unknown. 


tz: Mean parameter of X, with its prior distribution assumed in later discussions. 
Variance parameter of X, with its prior distribution assumed in later discussions if unknown. 


w;: Frequency weight for the i-th case. A non-integer frequency weight is rounded to the nearest integer. 
For values less than 0.5 or missing, the corresponding case will not be used. 


N: Number of cases in the data set. 


W: Effective sample size W = 25 w;. W = N if no weights are present. 


Normal Prior with Known Variance 


In this section, we assume that the variance parameter o? is known. Although this situation is not common 
in practice, we consider it a nice example for a teaching perspective. 

We place a normal prior on juz by assuming that yu ~ Normal(jio, 0%), where juo and o@ are specified by 
users. Under this setting, the marginal posterior distribution of juz is 


© H2|(X,o2) ~ Normal(tin, 07), 
1 W\ Wa 
where 02 = (= + =) ; amd jb. =a" (4 + wr). We may find the Bayes estimators of pz by 
o, 


computing the mode 


x 


fx = bn ; (21) 


the expected value 


E (ta |X) =-n ; (22) 


and the variance of the marginal posterior distribution of ju.|X 
V(telX) = 02 . (23) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju, such that 


» € (IdfNorm(5, Lin, 02), IdfNorm(1 — 5! Ln Vo3) ) (24) 


with the probability of c, where c = 0.05 by default. 
For two-sample paired t-test, we can repeat the procedure by replacing 0? with o%, and placing the prior 
on the mean difference. 


Diffuse Prior with Known Variance 


In this section, we assume that the variance parameter 0? is known, and place a flat prior on jz, by assuming 
that p(2) « 1. Under this setting, the marginal posterior distribution of ju, is 


© {x|(X,02) ~ Normal(Z, 02/W). 
We may find the Bayes estimators of 4, by computing the mode 
fiz = & , (25) 


the expected value 


o(po|X) = 2 , (26) 


and the variance of the marginal posterior distribution of j1.|X 


V(H2|X) = o2/W . (27) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ji, such that 


c (JatNorm(5, z, \/o2/W ), IdfNorm(1 — ©, z, /o2/W )) (28) 


with the probability of c, where c = 0.05 by default. 
For two-sample paired t-test, we can repeat the procedure by replacing 0? with o%, and placing the prior 
on the mean difference. 


eu 


Normal-Inverse Chi-Square Priors 
In this section, we assume and place the following priors 


e o2 ~ Inverse-x? (V9, 02) 


1 
© js|o8 ~ Normal( Ho, 03) 


where o? is conditioned on, and scaled by ko (Ko > 0, and Ko = 1 by default). Note that vp, o%, Wo, and Ko 
are specified by users. Under this setting, the marginal posterior distributions are 


e o2|X ~ Inverse-y?(Yp,, 02) 


1 
id [x|X ~ ty, (Uns On)» 


n 


WwW 1 Ny. 
where Vp, = Yo+W, Kn = KotW, Lin = Hon te, and o2 = — = + (1008 + > wi(x; — 2)? 4 We 
Kn Yn i=1 n 
We may find the Bayes estimators of tz, - computing the mode 


fle = Un, (29) 
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the expected value 


(pa |X) =-n ; (30) 


and the variance of the marginal posterior distribution of ju.|X 


Yn o2 


V(ue|X) = (n= 2) Ba, : 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ji, such that 


2 2 
Be © { tn —IdtT (1-5, vm) {2 , tin + IAAT (1-5, vm) 4/2 (32) 


with the probability of c, where c = 0.05 by default. 
For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 


Normal-Inverse Gamma Priors 
In this section, we assume and place the following priors 


e o2 ~ Inverse-Gamma(qg, So) 


© pz|\o2 ~ Normal(tuo, ay, 
Ko 

where o2 is conditioned on, and scaled by ko (Ko > 0, and Ko = 1 by default). Note that a9, Bo, bo, 
and Ko are specified by users. Under this setting, we can simply set vp = 2a9 and 0? = 289/v. Thus, 
o2 ~ Inverse-y7(2a0, 2380/v). The same approach in the “Normal-Inverse Chi-Square Priors” section can 
be repeated to compute the posterior distributions. 

For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 


Normal-Gamma Priors 


In this section, we reparameterize o2 by letting tT, = 1/02, which denotes the precision parameter. We 
assume and place the following priors. 


e 7, ~ Gamma(a, Jo) 


), 


© fx|Te ~ Normal(jo, me 
x 


where 7, is conditioned, and scaled by Kg (Ko > 0, and kp = 1 by default). Note that ag, 69, 4g, and Kp are 
specified by users. Under this setting, the marginal posterior distributions are 


e 7,|X ~ Gamma(an, Bn) 


Bn 


nkn 


), 


e fa |X ood toa, (Uns 


W 1 N W(z— 2 
where a, = Qo + 9” Bn = Bo + 2 YS wi (x; zp)? ale AoW (tH) 
w=1 


K W 
, [in = [lo — + E—, and Kn = Ko + W. 


= 2(Ko + W) Kn Kn 
We may find the Bayes estimators of 4, by computing the mode 
Lax = Un 5 (33) 


the expected value 


(Ma |X) = bn , (34) 


and the variance of the marginal posterior distribution of j..|X 


Bn 
V(ux|X) = ——————_ . 35 
(Ha|X) (20k (35) 
We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering jz, such that 
ln € pn — TdfT (1 - 5, vn) Bn , Hn + 1afT (1-5, vn) Bn (36) 
2 AnkKn 2 Ankn 


with the probability of c, where c = 0.05 by default, and vy, = 2an. 
For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 


Normal-Chi-Square Priors 


In this section, we reparameterize o2 by letting tT, = 1/02, which denotes the precision parameter. We 
assume and place the following priors. 


© Te ~ x7(A) 
° Ma |Tx on Normal(Ho, ee, 
x 


where 7, is conditioned, and scaled by ko (Ko > 0, and ko = 1 by default). Note that A, 40, and Ko are spec- 
ified by users. Under this setting, we can simply set ag = A/2 and Bo = 1/2. Thus, tT, ~ Gamma(A/2, 1/2). 
The same approach in the “Normal-Gamma Priors” section can be repeated to compute the posterior dis- 
tributions. 

For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 


Jeffreys Priors 


In this section, we assume and place the Jeffreys priors 


1 1 
e p(o;) x — [Yang and Berger, 1996] or p(o2) «x — [Kass and Wasserman, 1996 
oO oO 


zx zx 
a P(H2|o2) xl, 


where there are two optional priors on 02, and zz has a flat prior. Under this setting, the marginal posterior 
distributions are 


e o2|X ~ Inverse-Gamma(Qn, Bn) 


© Ux|X ~ ty, (Z, aes 


where 
1 W+1 2 1 
for p(o2) x =, Gn = ———, Bn= »Y¥, =W +1, and o2 = ———_ Yo w;(2;-2)’, 
Pra) ot 2 oy wi(ai — B)? W(W +1) y ( ) 
and ‘ 
1 W-1 2 1 
for p(o? Xx >, An = ——, Bn = Yn = W-1, and o2 = —_ w(x; —Z)?. 


We may find the Bayes estimators of u, by computing the mode 
fig = Z , (37) 


the expected value 


U(Ha|X) = & , (38) 
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and the variance of the marginal posterior distribution of ju.|X 


Vn 
V(te|X) = 2 03. (39) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering ju, such that 


je (#-TdtT( - 5, mn) oR, B+IdFT(1 - 5, vn) o? ) (40) 


with the probability of c, where c = 0.05 by default. 


For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 


Diffuse Priors 


In this section, we assume and place the diffuse priors 

© poz) «1 

e p(uz|o2) « 1, where both py, and o? have a flat prior. 
Under this setting, the marginal posterior distributions are 


e o2|X ~ Inverse-Gamma(ap, Bn) 


W-3 2 1 
where a, = <= , Un = W — 3, and o2 = ——__ w(x; — Z)?. 
2 EN, wilai — 2)? Www — ay 2 m9) 
We may find the Bayes estimators of yu, by computing the mode 
fic = 2, (41) 
the expected value 
U(He|X) = © , (42) 


and the variance of the marginal posterior distribution of |X 


V(iie|X) = —"= 02. (43) 


U,—-2 ” 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering j, such that 
lly € (#-IaeT(1 - 5, vm) oR, B+ IAET(— 5, yn) 7? ) (44) 
with the probability of c, where c = 0.05 by default. 


For two-sample paired t-test, we can repeat the procedure by placing the priors on the mean and the 
variance of the difference between the two paired variables. 
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One-Sample Bayesian Inference on Binomial Distribution 


Using Bayes-Factor 


Notations 


The following notations defined in this section will be used for the subsequent sections. 


xX: 


X = (X1,Xo,..., Xn), a realization of Bernoulli trials with p(X; = 1) = 7 and p(X; = 0) =1—-7. 
X is observed by x = (21,%2,...,2N), where x; is either 0 or 1. Note that we can handle any 
categorical variables with two different levels, either numeric (4 and 5) or string (Yes and No), which 
will be recoded to 0 or 1. 


A total fixed number of cases (trials) in the data set. 


f = (fi, fo,---, fn), a frequency or replication weight for X. Non-integer frequency weights are 
rounded to the nearest integer. For values less than 0.5 or missing, the corresponding case will not 
be used. 


Ny = sy fi. If there is no frequencies present, Np = N. 
Y= yoy f,X; ~ Binomial(N;,7), where Y is observed by y. 
A population proportion parameter under the null hypothesis Hp. We assume that 7g ~ Beta(ao, bo). 


A population proportion parameter under the alternative hypothesis H;. We assume that 7, ~ 
Beta(ay, bi). 


Bayes-Factor Based on Beta-Binomial Distribution 


The Bayes factor based on the Beta-Binomial distribution is 


Ae Pr(¥|Ho) _ Blao + y,bo + Np — y) Bla, bx) (1) 
Pr(Y|H1) Bay Se Y; by ee. Ny a y)B(ao, bo) ? 


where B is the beta function defined by 


Using Conjugate and Noninformative Priors 


Notations 


The following notations defined in this section will be used for the subsequent sections. 


xX: 


X = (X1,Xo,..., Xn), a realization of Bernoulli trials with p(X; = 1) = 7 and p(X; = 0) =1—-7. 
X is observed by x = (21,%2,...,2N), where x; is either 0 or 1. Note that we can handle any 
categorical variables with two different levels, either numeric (4 and 5) or string (Yes and No), which 
will be recoded to 0 or 1. 


A total fixed number of cases (trials) in the data set. 


f = (fi, fo,---, fn), a frequency or replication weight for X. Non-integer frequency weights are 
rounded to the nearest integer. For values less than 0.5 or missing, the corresponding case will not 
be used. 


Ny = Sa fi. If there is no weights present, Ny = N. 
Y= ae f:X; ~ Binomial(N;,7), where Y is observed by y. 


A population proportion parameter, with its prior distribution assumed in later discussions. 
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Beta Prior 


In this section, we place a conjugate prior placed on 7 by assuming that 7 ~ Beta(a, b), where a,b > 0. The 


sufficient sample size to estimate is Ny > 1. 
Under this setting, the marginal posterior distribution of 7 is 


e 7|Y ~ Beta(a+y,b+ Ny — y). 
Note that this also applies to the following two special cases: 
e Uniform prior, when a = b= 1, 
e Jeffreys prior, when a = b = 0.5. 
We may find the Bayes estimators of 7 by computing the expected value 


pesca ae 
a+b+ Ny’ 


(mY) = 


and the variance of the marginal posterior distribution of 7|Y 


(a+ y)(O+ Ny —y) 
(a+b+ Ny)?(a+b4+ Nye 41) © 


V(r|/Y) = 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering 7 such that 


Cc 


TE (1arBeta(s 


,a+y,b+N;—y), IdfBeta(1— 5,a+y,6+N;—y)) 


with the probability of c, where c = 0.05 by default. 
To find the mode of 7|Y needs a few more discussions on the parameter support. 


elfaty>landb+Ny-y>l, 
a+y-1 
a+b+N;—2° 


r= 


(6) 


e Ifat+ty<ilandb+N;—y <1, the left mode is 0, the right mode is 1, and we define the anti-mode 


a+y-1 
a+b+N;—2° 


T= 


In the output design, we may indicate that this is the “anti-mode’”. 


elIfaty<landb+N;-—y2>1,orifa+y=landb+Ny—-y>1,7=0. 


elfat+ty>landb+Ny—y<l,orifat+y>landb+Ny-—y=1,7=1. 


(7) 


e Note that ifa+y—=b+ Ny —y =1, the posterior distribution is actually a uniform distribution with 


the mode equal to any value in the range [0, 1]. 


Haldane’s Prior 


The density function of the Haldane’s prior is p(7) = ~!(1—7)~1, which is an improper prior distribution. 
It can be treated as a special Beta distribution with a = b = 0. The preceding statistics derived from 
the posterior distribution still apply. Hence, we can allow a conjugate prior placed on 7 by assuming that 


ma ~ Beta(a, b), where a,b > 0, together with a special case of a = b = 0 to handle the Haldane’s prior. 


Define “Success” for Variables 


To include the scale variables and the categorical variables with more than two levels, we discuss several 


ways to define the “success” category, recode and dichotomize the variables. 
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Numerical Variables 


To dichotomize a numerical variable with two or more than two values, we offer the following options to 
define “success”: 


e Using the last value found in the category after sorted in an ascending order, which is the setting by 
default. 


e Using the first value found in the category after sorted in an ascending order. 
e Using the values > the midpoint which is the average of the minimum and maximum sample data. 


e Using the values > a specified cutoff value. 


e Using the specified values (can be more than 1) in the sample data. 


String Variables 
To recode a string variable with more than two levels, we offer the following options to define “success”: 


e Using the last level found in the category after sorted in an ascending order, which is the setting by 
default. 


e Using the first level found in the category after sorted in an ascending order. 


e Using the specified levels (can be more than 1) in the sample data. 
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One-Sample Bayesian Inference On Poisson Distribution 


Using Bayes-Factor 
Notations 


The following notations defined in this section will be used for the subsequent sections. 


X: X =(X,,Xe,...,XN), a random sample from Poisson distribution of mean A, or X; ~ Poisson()). 
X; = 0,1,2,..., which takes a nonnegative integer. 


N: A total number of cases (events) in the data set. 


f: f = (hi, fe,..-, fn), a frequency or replication weight for X. Non-integer frequency weights are 
rounded to the nearest integer. For values less than 0.5 or missing, the corresponding case will not 
be used. 


Ny: Nye = ue fi. If there is no frequencies present, Ny = N. 
Y: Y= aun f:X; ~ Poisson(N;A), where Y is observed by y. Note that Y is a sufficient statistic. 
Ao: A population rate parameter under the null hypothesis Hj. We assume that Ao ~ Gamma(apo, bo). 


Ai: A population rate parameter under the alternative hypothesis H;. We assume that Ay; ~ Gamma(qaj, b1). 


Bayes-Factor Based on Gamma-Poisson Distribution 


Consider the probability density function for Gamma prior defined by 


be a—1,—bxr 
Pray € ; 


where a,b > 0. If bo and 6; are rate parameters, the Bayes factor based on the Gamma-Poisson distribution 
is 


p(Ala, b) = (1) 


_ Pr(¥|Ho) _ bo? (br + Np) T" P(ao + y) Par) (2) 
Pr(Y|Hi) bf? (bo + Ng )tY Play + y) (ao) ’ 


Ao1 


where I is the gamma function defined by 
T'(k) = | te hde, (3) 
0 


Using Conjugate and Reference Priors 
Notations 


The following notations defined in this section will be used for the subsequent sections. 


X: X =(X,,Xo,...,XN), a random sample from Poisson distribution of mean A, or X; ~ Poisson()). 
X;, = 0,1,2,..., which takes a nonnegative integer. 


f:  f = (hi, fe,..., fn), a frequency or replication weight for X. Non-integer frequency weights are 
rounded to the nearest integer. For values less than 0.5 or missing, the corresponding case will not 
be used. 


Ny: Nye = ae fi. If there is no frequencies present, Np = N. 
Y: Y= pur f:X; ~ Poisson(N;A), where Y is observed by y. Note that Y is a sufficient statistic. 


A: A population rate or intensity parameter, with its prior distribution assumed in later discussions. 
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Gamma Prior 


In this section, we place a conjugate prior on \ by assuming that \ ~ Gamma(apo, bo), where ao, 09 > 0, and 
bo is the rate parameter. The probability density function of the prior is thus 


b5° 
T'(ao) 


Under this setting, the marginal posterior distribution of A is 


p(Alao, bo) = ea Aaae (4) 


e AY ~ Gamma(ay, by), 


where ay = ae fixti tao = y+ao, and by = N¢+bo. We may find the Bayes estimators of \ by computing 
the expected value 


S(A|Y) = an/bn , (5) 


and the variance of the marginal posterior distribution of A|Y 


V(AIY) = an/by - (6) 
We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering \ such that 


de (JatGam(5, an, by), IdfGam(1 — <, ae bie )) (7) 


with the probability of c, where c = 0.05 by default. 
To find the mode of A|Y needs a few more discussions on the parameter support. 


e If ay > 1, 


A = (an —1)/bn . (8) 


e If 0 <ay <1, mode does not exist. The density curve forms an asymptote near y = 0. 


Uniform Prior 


In this section, we place a reference prior on A by assuming that A ~ Uniform(0,1). Actually this prior 
follows a special case as discussed in the “Gamma Prior” section with ag = 1 and bo = 0. Under this setting, 
the marginal posterior distribution of X is 


e AY ~ Gamma(ay, by), 


where ay = yo fits +1 =yt1, and by = Ny. We may find the Bayes estimators of A by computing the 
expected value 


O(A|Y) = an/bn = (y+ 1)/Ny , (9) 


and the variance of the marginal posterior distribution of A|Y 


V(ALY) = an /by = (y +:1)/NF - (10) 
We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering \ such that 
de (1atGam(5, y +1, Ny), IdfGam(1 — =, y+1, Ny )) (11) 


with the probability of c, where c = 0.05 by default. 
Similarly, the mode of A|Y depends on the parameter support. 


e If an a 1, 7 
= (ay —1)/bw = y/Ny. (12) 


e If 0 <ay <1, mode does not exist. The density curve forms an asymptote near y = 0. 
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Jeffreys Prior 


In this section, we place the Jeffreys prior on \ by assuming that p(A) x A~!/?. Actually this prior follows 


a special case as discussed in the “Gamma Prior” section wi 
marginal posterior distribution of X is 


e AY ~ Gamma(ay, by), 


th ag = 1/2 and bp = 0. Under this setting, the 


where ay = ey fix; +1/2 = y+1/2, and by = Ny. We may find the Bayes estimators of \ by computing 


the expected value 


and the variance of the marginal posterior distribution of 


S(A|Y) = an/bnw = (y+ 1/2)/Ny , (13) 


Y 


V(AIY) = ay /b% = (y + 1/2)/NF . (14) 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering \ such that 


de (1afGam(5, y-+1/2, Ny), TdfGam(1— 5, y+ 1/2, Ny )) (15) 


with the probability of c, where c = 0.05 by default. 


Similarly, the mode of A|Y depends on the parameter support. 


e If ay > 1, 


N= (aw — 1)/bn = (y-1/2)/Ny. (16) 


e If 0 <ay <1, mode does not exist. The density curve forms an asymptote near y = 0. 
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BAYES REGRESSION Algorithms 


Bayesian Inference on Multiple Linear Regression Models 


Bayesian Inference on the Linear Regression Models 


Let q; denote the regression weight for the 7-th case in the n observations. If there is no regression weight 
specified, g; = 1. If q; < 0 or missing, the corresponding case is not used. Let f; denote the frequency 
weight for the i-th case in the n observations. A non-integer f; is rounded to the nearest integer. For 
fi < 0.5 or missing, the corresponding case will not be used. We further define w; = qifi, and W = 
diag(wi, we,..-,Wn) = diag(qi fi, gaf2,---,@nfn), OF 


afi O - 0 
0 gfe --- 0 

w=| . (1) 
0 O 458 Gada aos: 


Note that the effective sample is N = )7"_, fi. N = if no frequency weights are present. 


Using Bayes Factor 
Zellner’s Method 


Zellner once suggested a g prior broadly discussed under M, |Zellner, 1986}: 
© p(a,¢|M1) = 1/¢. 
e B\(¢,9, M1) ~ Normal (0, Lixtwxy), where g is fixed. 


Since g is fixed, Zellner’s g prior has the computational efficiency. Under these settings, the Bayes factor 
suggested by Zellner between M, and Mo has a closed form 


z ie —(N—1)/2 


where g > 0, which is fixed and preset, and R? is the unadjusted proportion of variance accounted for by 
the covariate which can be similarly computed by the REGRESSION algorithm. 


Jeffreys-Zellner-Siow’s (JZS) Method 
The Bayes factor suggested by Zellner and Siow between M, and Mo is 


: = ae —(N-1)/2 N/2 _3)9 _ 
Ce =f @ Seg 1—p)/2 [1 + g(1 = R?)| ( )/ / g 3/26 N/(2g9) dg, (3) 
0 P(1/2) 


where ['(1/2) = \/z, and R? is the unadjusted proportion of variance accounted for by the covariate which 
can be similarly computed by the REGRESSION algorithm. 

Hyper-g Method 

The Bayes factor suggested by Liang et al between Mj, and Mo is 


—(N-1)/2 
dg (4) 


(5) 


where a is preset, R? is the unadjusted proportion of variance accounted for by the covariate which can be 
similarly computed by the REGRESSION algorithm. 


a-—2 [* 
Aol) == f+)? [1+ g(t - 
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Rouder’s Method 
The Bayes factor suggested by Rouder and Morey between M, and Mo is 
e -(N- Ns? /2 
Nata) =| (1 gi 2-2 [1 +g9(1— R?)| (N-1)/2 (Liat er dg, (6) 
0 


where s (s > 0) is specified by users, (1/2) = \/7, and R? is the unadjusted proportion of variance accounted 
for by the predictors which can be similarly computed by the REGRESSION algorithm. 


Full model based Bayes factors 


Although most of the approaches are based on the comparison of Mj, with the null model Mo, it is still 
necessary and tractable to derive a full model based Bayes factor. The full mode can be expressed by 


Mrp:iy=1natxXB+Zy+e. (7) 


The null hypothesis we desire to test is Ho : y = 0. To construct the Bayes factor based on the user- 
defined full model, we may follow the similar procedures aforementioned in the previous sections. Thus, the 
Bayes factors between M, and Mp by different methods are 


1 = R2 (N—p—1)/2 
Zellner: Azp(g) =(1+g)78-P-Y/? i +9 ( = 7) | (8) 
1 
ioe) 1 R2 (N—p-1)/2 N/2 
7S: s _ 1 —(N-P-1)/2 |4 F -3/2,-N/(29) ) gq 
(9) 
=) foe) 1 2 (N—p—1)/2 
Hyper-g: Ah,(a) =" | (beg) ern? heey = dg , (10) 
2 0 1 ~ Ry 
me 1— RA\) A? VP fs, /N/2 : 
; r = 1 —(N—P-1)/2 |4 F —3/26-Ns"/(29) | q 
Rouder ip(s) i (1+4q) +g 1- Re? T(/2) g e€ g , 
(11) 


where R? and R7, are the unadjusted proportion of variance accounted for by the covariate of the models 
My, and My. The integrals in the Equations (8)- (1a) can be numerically approximated by feeding in the 
correct input f(g). 


Characterizing Posterior Distributions 


In this section, we still consider Model M, represented by Equation (??). In the following discussions, we 
define 87 = (a, 3"), and 


Ll ayy @qy +t Lyl 
1 wg 292 +++ Xp2 

A=(L XS An eeetol=|. 2. . . -« (12) 
1 Zin Fn °° Lon nx (p+1) 


Note that the columns of A must be linearly independent, and rank(A) = p+ 1. In practice, we can release 
this restriction. From the following presentation, we let (ATW A)~! denote the generalized inverse of AT A, 
and do not assume that ATW .A is nonsingular. 

Recall from the conventional statistical analysis on multiple linear regression models, the unbiased esti- 
mates of the regression parameters are 


6=(ATWA) ATWy, (13) 


and the variance of the error terms 


ea aCesy (y Ad) w (y— Ad) (14) 
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Using Conjugate Priors 

We place a conjugate prior by assuming that 
e o° ~ Inverse-Gamma(apo, bo), 
e O\o? ~ Normal (6, 07Vo). 


Note that Vo must be positive definite, and specified with a correct size dimension. Otherwise, the output 
will give a warning message, and use the identity matrix I,,,, to continue the analysis. In case that redundant 
columns are identified in the design matrix, we will not do any estimating, but assign 0 to the estimated 
coefficients. For Bayesian prediction, we will zero out the corresponding elements in 09 and, columns and 
rows in Vo. 


Regression parameters 9: Under this setting, the resulting marginal posterior distribution 6|X, y follows 
a scaled multivariate ¢ distribution with v degrees of freedom, where v = 2a) + N. 
Before finding the Bayes estimator of 0, we define the following quantities: 


0, =(Vo'+ATWA) (Vo 10) + ATWy) , (15) 

Vi=(Vo'+A™WA), (16) 
N 

a, = ao + oy 5 (17) 
1 

br =bo +5 (05 V0) + y" Wy — 01 (Vo + ATWA)9O)) . (18) 


Hence, assuming that v > 4, we compute the mode 


6=0,=(Vo'+ATWA) | (Vo 100+ ATWy) , (19) 
the expected value 
(0|X,y) =O, = (Vo! + ATWA) | (Vo 10) + ATWy) , (20) 
and the variance-covariance matrix 
C(A|X,y) = rae (21) 


where V,, a;, and 6, are defined by Equations (16)-(18), and the diagonal elements are the variances of the 
elements in 0 = (a, 61, B2,..., Bp)". Define 


By, Bry Dipaa 
Boy By» Bop b 
B*= ee ew (22) 
: ; : ay 
Boe Bo+12 Te Do +ip+i 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering a and §; such that 
ae (6, —laer( - 5, v)/ Bh, 6: + aer(1 - £, v)\/Bi,) , and (23) 


ra G a Cc 
Bi (4:4 ~I4FT(L- 5, )y/Bryaesas Oar + TAQT(1 - §, 0), [Bhsva ) (24) 


with the probability of c, where c = 0.05 by default, i = 1,2,...,p, 0; is the i-th element in 6, and B*,,,,, 
is the (¢ + 1)-th element on the diagonal of B*. Note that we only need to evaluate the diagonal elements 
of B*. 


BAYES REGRESSION Algorithms 


Error variance o?: Under the setting of conjugate priors, the marginal posterior distribution of 0? is 
o?|X,y ~ Inverse-Gamma(ay, 01) , (25) 


where a; and 6; are defined by Equations (17) and (18), respectively. 
We may find the Bayes estimators of o~ by computing the mode 


AQ by 


at (26) 
the expected value 
2(07|X,y) =, (27) 
for a; > 1, and the variance of the marginal posterior distribution of o?|X,y 
b2 
V(o"|X,y) = ; (28) 


(a1 — 1)?(a1 — 2)’ 
for a; > 2. We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering a? such that 


oe (1/latGam(1 <, a1, b1), 1/lafGam( 5, tie, bi) ) (29) 


with the probability of c, where c = 0.05 by default. 


Predicted Value y: Suppose we have observed a new m x (p+ 1) matrix of regressors A, and we are 
interested in predicting the corresponding outcome y. Given @ and o?, it follows that 


g|A, 0,02 ~ Normal (49, 0?In) (30) 
The marginal posterior distribution of y is 
~ uh ae a 
g|A,y ~ toa, (4s, sf beok Av,4") (31) 
ay 


where aj, 6;, and Vj are defined in Equations (17), (18), and (16), respectively. We may find the Bayes 
estimators of y by computing the mode 


y = Ad, (32) 
the expected value 7 7 
and the variance-covariance matrix 
= b _ = 
CIA, y) = —— (Im + AV AT) . (34) 
ay — 
Define 
Di, Die Din 
DE. De, axe DE. bh a 
D* = . es 7 : = ie (Im + Ay, A") : (35) 
: : ; : ay 
Din Dina Diam 
We may also find a 100(1—c)% Bayesian credible interval with equal tail covering Y = (91, 92,..-, im) such 
that 
hi € (Hi — TATA — &, 201) /DF,, Hi + TalT(1 - $, 201) Dz), (36) 


with the probability of c, where c = 0.05 by default, i = 1,2,...,m, Yi is the 7-th element in y, and D*, is 
the i-th element on the diagonal of D*. Note that we only need to evaluate the diagonal elements of D*. 
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A subset of 8: If we desire to make statistical inference on 8 = c by including all the regression parameters, 
we may construct a Bayesian F-statistic by computing 


(0; = c)'V, (1 = c) 2 2a, 


Fy41,20, ) (37) 


where 0,, Vi, a1, and b; are defined by Equations (15)-(18), respectively, p denotes the number of non- 
redundant parameters excluding the intercept term, and c is a vector of testing values specified by users 
with it number of elements equal to the number of parameters under estimating. By default, c = 0. The 
associated p-value is thus 1 — CDF.F(¥,p + 1, 2a), where CDF.F is the IBM@®) SPSS® Statistics function 
for the cumulative F' distribution. 

Furthermore, it is not an uncommon scenario in which we are interested in a subset of k non-redundant 
parameter(s) in 8, where 1 < k < p+1. Note that the redundant parameter(s) specified by users, if any, will 
be removed before the F-statistic is estimated. Let 0’ denote such k parameter(s) to be tested. To make 
inference on 6’, we rewrite the null hypothesis as Hy : L@ = c by constructing an appropriate Lx (p41) 
matrix such that its element on the i-th row and the 7’-th column is equal to 1 with the rest elements equal 
to 0, where i (1 <i <k) and 7 (1 < #’ < p+1) are the position index of the parameter(s) in 6’ and 6, 
respectively. For instance, if 0’ = (61, 63)", the L matrix would be 


01000. 0 
L=( ) . (38) 
DG 01 tee 0 5 cas 


The F-statistic can be formulated by 


(LO, —c)" [LV,L"|"'(L0,—c) 2a, 


Vi ~N a r 
5(0') = oh, ee ees (39) 


The associated p-value is thus 1 — CDF.F(%, k, 2a,), where CDF.F is the IBM@® SPSS® Statistics function 
for the cumulative F distribution. Note that Equation is a special case of Equation when DD = 


T(p41)x(p+1)- 


Using Standard Reference Priors 


By setting Vj,’ + 0, a9 = —(p + 1)/2, and bp = 0, it turns out that we place a reference prior by assuming 
that 
p(O,07) x 1/o?. (40) 


Regression parameters 8: Under the setting of Equation (40), the resulting marginal posterior distri- 
bution 6|X, y follows a scaled multivariate ¢ distribution with vy = N — (p+ 1) degrees of freedom. We can 
also find the Bayes estimators of 0, assuming that ATA is nonsingular, by computing the mode 


6=(ATWA) ATWy, (41) 
the expected value 
(0|X,y)=(ATWA) ATWy, (42) 
and the variance-covariance matrix 
CO|X,y) =? (ATWA) (43) 
ss 
where 
dl r x 4 
s* = —(y— Ad)'W(y — Ad) = — [y’Wy-—y'WA(A'WA) 'ATWy] , (44) 
Zi V 
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and the diagonal elements are the variances of the elements in 6 = (a, 81, 82,..., By)?. Define 
Aly Aly +++ Allpya 
At At os At 7 : 
At={ 2 0 PFT =? (ATWA) (45) 
Api Ap+12 ac Ap +ip+1 


We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering a and §; such that 
ae (4 —IdfT(1 — <, v) /Az, , 6, + IdfT(1 — <, v)/Ai ) , and (46) 


* (or a Cc 
8, € (4:41 —IdfT(1— 5, v)\/ Aten» Sina + AQT (1 — 5, v), is) (47) 


with the probability of c, where c = 0.05 by default, 2 = 1,2,...,p, 6; is the i-th element in 6, and AF ij44 

is the (¢ + 1)-th element on the diagonal of A*. Note that we only need to evaluate the diagonal elements 

of A®*. 

Error Variance o?: Under the prior setting of (40), the marginal posterior distribution of o? is 
o?|X,y ~ Inverse-y?(v, s”) = Inverse-Gamma(v/2, vs” /2) , (48) 


where v = N — (p+1), and s? is defined by Equation (44). 
We may find the Bayes estimators of o? by computing the mode 


the expected value 


Vv 
:(0?|X,y) = “58, (50) 


y— 
for v > 2, and the variance of the marginal posterior distribution of o?|X,Y 
Qv? 4 


V(o*|X,y) = @_-2220—-4) ’ (51) 


for v > 4. We may also find a 100(1 — c)% Bayesian credible interval with equal tail covering a? such that 
oe (1/TafGam(1 = 7 7 
with the probability of c, where c = 0.05 by default. 


ee Ey 2 
<8"), 1/ldfGam(5, 5, 5s )), (52) 


Predicted Value y: Suppose we have observed a new m x (p+ 1) matrix of regressors A, and we are 
interested in predicting the corresponding outcome y. Given @ and o?, it follows that 


g|A, 0,02 ~ Normal (49, 07I,) (53) 
The marginal posterior distribution of y is 
G/A,y ~ typi) (A(ATW A) ATW, 5° (Im + (ATW A) 1A") , (54) 
where s? is defined by Equations (44). We may find the Bayes estimators of y by computing the mode 
y = A(A'WA) ‘1A ’Wy = AO, (55) 


the expected value 


(G|A,y) = A(ATW A) ATWy = AO, (56) 
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and the variance-covariance matrix 


CIA, y) = "5 8? (Im + A(ATWA)"1A") , (57) 
ie 
where vy = N — (p+1). Define 
Fy Fry Pin 
“ Fy 2) ite Pom 2 ~ T 177 
Fra} SP | = 8? Im + A(ATWA)1AT) (58) 
Fina oe Eatin 


We may also find a 100(1—c)% Bayesian credible interval with equal tail covering 9 = (1, 92,---; 9m) such 
that 


hi € (Gi —TaeT(L — $v) PR, Ge + TaeT(1 - $, v) Fi), (59) 


with the probability of c, where c = 0.05 by default, i = 1,2,...,m, 4 is the i-th element in y, and F’* is 
the 7-th element on the diagonal of F*. Note that we only need to evaluate the diagonal elements of F™. 


A subset of 6: If we desire to make statistical inference on @ = c by including all the regression parameters, 
we may construct a Bayesian F’-statistic by computing 


(@—c)TATWA(O—c) vv 


3(8) = : hy On 
y'Wy—OTATWAO pti 


Posi, ’ (60) 


where v = N — (p +1), @ is defined by Equation (13), p denotes the number of non-redundant parameters 
excluding the intercept term, and c is a vector of testing values specified by users with it number of elements 
equal to the number of parameters under estimating. By default, c= 0. The associated p-value is thus 1 — 
CDF.F(§,p+1,v), where CDF-F is the IBM® SPSS@® Statistics function for the cumulative F distribution. 

Furthermore, it is not an uncommon scenario in which we are interested in a subset of & non-redundant 
parameter(s) in 8, where 1 < k <p+1. Note that the redundant parameter(s) specified by users, if any, will 
be removed before the F-statistic is estimated. Let 0’ denote such k parameter(s) to be tested. To make 
inference on 0’, we rewrite the null hypothesis as Hp : L@ = ce by constructing an appropriate Dy (p41) 
matrix such that its element on the i-th row and the 7’-th column is equal to 1 with the rest elements equal 
to 0, where i (1 <i <k) and? (1 < #’ < p+1) are the position index of the parameter(s) in 6’ and 6, 
respectively. For instance, if 0’ = (6, 63)", the L matrix would be 


01000. 0 
b= ( a (61) 
0 Oe 4. ©: i OJ sag 


The F-statistic can be formulated by 


(L6 — c)" [L(ATWA)-!L"]' (LO—c)  v 


6’ = = = * ~ Ff ys 62 
a y'Wy —6°ATW AO bo (62) 


The associated p-value is thus 1 — CDF.F(%,k,v), where CDF.F is the IBM@® SPSS@® Statistics function 
for the cumulative F distribution. Note that Equation is a special case of Equation when DD = 


T(p41)x(p+1)- 


References 


George and Foster, 2000] George, E. and Foster, D. P. (2000). Calibration and empirical bayes variable 
selection. Biometrika, 87(4):731-747. 


Liang et al., 2012] Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2012). Mixtures of g 
priors for bayesian variable selection. Journal of the American Statistical Association. 


Rouder and Morey, 2012] Rouder, J. N. and Morey, R. D. (2012). Default bayes factors for model selection 
in regression. Multivariate Behavioral Research, 47(6):877—903. 


Sartori, 2003] Sartori, N. (2003). A note on likelihood asymptotics in normal linear regression. Annals of 
the Institute of Statistical Mathematics, 55(1):187-195. 


Zellner, 1986] Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with 


g-prior distributions. Bayesian inference and decision techniques: Essays in Honor of Bruno De Finetti, 
6:233-243. 


[Zellner and Siow, 1980] Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression hy- 
potheses. Trabajos de estadtstica y de investigacién operativa, 31(1):585-603. 


Bootstrapping Algorithms 


Bootstrapping is a method for deriving robust estimates of standard errors and confidence 
intervals for estimates such as the mean, median, proportion, odds ratio, correlation coefficient 
or regression coefficient. It may also be used for constructing hypothesis tests. Bootstrapping 
is most useful as an alternative to parametric estimates when the assumptions of those methods 
are in doubt (as in the case of regression models with heteroscedastic residuals fit to small 
samples), or where parametric inference is impossible or requires very complicated formulas 
for the calculation of standard errors (as in the case of computing confidence intervals for the 
median, quartiles, and other percentiles). 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 11-1 

Notation 

Notation Description 

K Number of distinct records in the dataset. 
Xr The kth distinct record, k=1,..,K. 

fi Frequency weight of the kth record. 

N Number of records, N = Sj_, fr. 

B Number of bootstrap samples. 

Ser Generated frequency weight for the kth record of the bth bootstrap sample. 
T Statistic to bootstrap. 

T; The bth bootstrap copy of statistic T. 

Ti) SS Ths) Ordered bootstrap values. 


Sampling 


The following sampling methods are available. 


Jackknife Sampling 


Jackknife sampling is used in combination with bootstrap sampling to approximate influence 
functions that are used in computing BCa confidence intervals. The algorithm is performed by 
leaving out one record at a time, and outputs the following frequency weights: 


fi-1 fi a fi 


fo fo 1 -_ fe 


lin fr os fr 1 
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Case Resampling 


In the context of bootstrapping, case resampling means to randomly sample with replacement 
from the original dataset. This creates bootstrap samples of equal size to the original dataset. The 
algorithm is performed iteratively over k=1,..,K and b=1,...,B to generate frequency weights: 


ru.binom (, i) k=1 
for = k 
ru.binom | N i otherwise 
= NEF} s, 
Stratified Sampling 


When subpopulations vary considerably, it is advantageous to sample each subpopulation 
(stratum) independently. Stratification is the process of grouping members of the population into 
relatively homogeneous subgroups before sampling. The strata should be mutually exclusive: 
every element in the population must be assigned to only one stratum. The strata should also be 
collectively exhaustive: no population element can be excluded. Then simple case resampling is 
applied within each stratum to generate frequency weights /;",,. . 


Residual Sampling 


Residual sampling supports bootstrapping of regression models. In this case, the predicted 
variable for each record will be adjusted with a residual that is randomly sampled in the residual 
set with replacement. This adjusted variable will be used as the dependent variable in the new 
bootstrap sample. Residual sampling assumes homoscedastic residuals. 


The following notation applies to residual sampling: 


Table 11-2 
Notation 

Notation Description 

(xe, Yh) Data pairs used to build regression models. 
Uk Predicted values under the fitted model. 

ex Residuals, €, = Ys — Yk 

(2; , Ybi) Data pairs for the bth bootstrap sample. 


For i=1,..,N, the algorithm sets: 


x; = Up (j) 


where k(i) maps i to k based upon /;,; that is, if fy=3 and fo=5, then k(1)=k(3)=1, k(4)=k(8)=2, 
and so on. 
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For i=1,..,N and b=1.,...,B, the algorithm sets: 
Yor = Yai) + € X Yv-multinomial ( f),..., fx) 


where € is the 1xk matrix of residuals and rv.multinomial (/,,..., fj”) produces a kx1 matrix 
representing a single draw from a multinomial distribution with relative frequencies f},..., fx. 


Wild Bootstrap Sampling 


Wild bootstrap is similar to residual sampling, but the sign of the bootstrap residual for each 
record is randomly reversed. Wild bootstrap is useful in the presence of heteroscedastic residuals 
and small sample sizes. 


For i=1,..,N, the algorithm sets: 


vk i: a 
x; =a k(i) 


where k(i) maps i to k based upon f;,; that is, if f,=3 and fo=5, then k(1)=k(3)=1, k(4)=k(8)=2, 
and so on. 


For i=1,..,N and b=1.,...,B, the algorithm sets: 


Yor = Yea) + (1 — 2rv.bernoulli (0.5)) (é x rv.multinomial ( f1, ..., fx) 

where é€ is the 1xk matrix of residuals and rv.multinomial (/;, ..., f,) produces a kx1 matrix 

representing a single draw from a multinomial distribution with relative frequencies f},..., fx. 
Pooling 


The following pooling methods are available: bootstrap estimates and percentile-t pivotal tests. 
Bootstrap Estimates 


Bias 


The bias of statistic T can be estimated by the following equation 
B 
Bias(T) =B-! Tf -T 
b=1 


Standard error 


The standard error of statistic T can be estimated by the standard deviation of the bootstrap values 
with the following equation 
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B 2 


SE x ree Tt - BAS Ts 


b=1 b=1 
Percentile confidence interval 


Suppose that T estimates a scalar pthat we want an interval with left- and right-tail errors both 
equal to a, and that bootstrap values are ordered as 77, < ... < T/,). The basic percentile 
confidence interval is 


Ba =T (B+l)ay Gie= Teng 1)(1—«)) 


If (B + 1) q is not an integer, then interpolation can be used. A simple method that works well for 
approximately normal estimators is linear interpolation on the normal quantile scale. For example, 
suppose the integer part of (B + 1) a is k, then we define 


6-! (qa) — o- 1 (54) 


o-! (e471) - 6-1 (54) Pkt) Ti) 


T(B41)a) = Le) 4 


where @~! (-) is the inverse normal(0,1) distribution. Similarly, if (6 +1)(1—a) is not an 
integer, the same interpolation can be used by replacing q@ with 1 — a in the equation above. 
Clearly such interpolations fail if k=0, B or B+1. If this happens, we quote the extreme value and 
the implied level of error equal to 1/(B + 1). 


BCa confidence interval 


The influence value of the ‘th record in the sth stratum is approximated by 


Liack,sk, = (Ns = 1) (ze = Ty.) 


where T_.).. is the estimate calculated from the original data but with the frequency /f.;.. — 1 fr 
the ‘th record in the sth stratum. It is reasonable to assume the empirical influence values 
lsh. =ljack.sk =¢ 
Defining lee. = Usk. N/N,_, the BCa confidence interval is given a 
Baxi T(B+1)a)? Ao=T, ((B+1)(1-&)) 
where 
a= (w + an Cee :) : 
lq = ae mC: hy 
HE sth <t) 
Boi 
1 Lenser i: kh 
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Interpolation will be used as in the Percentile confidence interval. 


Percentile-t Pivotal 
Tests 


Suppose the null hypothesis is Hp : T = 
To. Scalar T 


Let 29 = (7 —Ty) /SE and 2; = (7; — T) /SE;where SE and SF; are the standard errors of T 
and 7;*, respectively. We estimate the standard error from the standard errors calculated within 
the procedure. 


The alternative hypothesis canbe H4:T > To, H4: 2 < To, or H4: 1° 4 To, which correspond 
to right-sided, left-sided, and two-sided p-values, respectively. The bootstrap right-sided p-value 
is calculated as 


_ [kee 2 zo}| +1 
B+ 1 
The bootstrap left-sided p-value is calculated as 
_ [fz < zop| 41 
p= 
B+1 
The bootstrap two-sided p-value is calculated as . 
*2 ys 
= hee Es 2} +1 
B+1 


p 


p 
Vector T 


Let zo = (T — Tp)" Cov(T)~* (T — Ty) and zi = (T—T)* Cov(T;)~* (1 — T), where 
Cov (T) and Cov (T;') are the covariance matrices of T and 7,;*, respectively. We estimate the 


covariance matrix from the covariance matrix calculated within the procedure. 


The alternative hypothesis is H.4 : 4° # To, and the bootstrap p-value can be calculated as 


pa 4 2} |41 
B+l 


The percentile-t pivotal tests can also support bootstrap testing for the null 

hypothesis of Ho : LTT where L is a matrix of linear combinations. 

In this case, let 29 = (LT —T»)" {LCov(T) L™}~* (LT — 7) and 

zp = (LT; - LT)" {LCov (T;") LT\* (LT;* — LT). The alternative hypothesis is H, : LT 4 To, 
and nd the bootstrap p-value can be calculated as 


{= > zo} +1 
B+1 


p= 
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CATPCA Algorithms 


The CATPCA procedure quantifies categorical variables using optimal scaling, resulting in 
optimal principal components for the transformed variables. The variables can be given mixed 
optimal scaling levels and no distributional assumptions about the variables are made. 


In CATPCA, dimensions correspond to components (that is, an analysis with two dimensions 
results in two components), and object scores correspond to component scores. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table 12-1 
Notation 
Notation Description 
n Number of analysis cases (objects) 
Nw m: 
Weighted number of analysis cases: YS wy 
4=1 
Ntot Total number of cases (analysis + supplementary) 
wi; Weight of object i; w; = 1 if cases are unweighted; w; = 0 if object i is 
supplementary. 
WwW Diagonal nto: X Nor Matrix, with w; on the diagonal. 
m Number of analysis variables 
May m 
Weighted number of analysis variables (m.,,, = » v;) 
j=! 
Mot Total number of variables (analysis + supplementary) 
my Number of analysis variables with multiple nominal scaling level. 
m2 Number of analysis variables with non-multiple scaling level. 
Mut Weighted number of analysis variables with multiple nominal scaling level. 
Mw2 Weighted number of analysis variables with non-multiple scaling level. 
J Index set recording which variables have multiple nominal scaling level. 
H The data matrix (category indicators), of order nis. X Mrot, after 
discretization, imputation of missings , and listwise deletion, ifapplicable. 
Pp Number of dimensions 
For variable j; 7 = 1, ..., 1tyo1 
Table 12-2 
Notation 
Notation Description 
Uj Variable weight; »; = 1 if weight for variable j is not specified or if variable 
j is supplementary 
ky Number of categories of variable j (number of distinct values in h;, thus, 


including supplementary objects) 


G; Indicator matrix for variable j, of order "tot X kj 
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The elements of G; are defined asi = 1, ...,juj;r =1,...,h; 

ey when the ith object is in the rth category of variable 7 
4)" ~) @ when the ith object is not in the rth category of variable j 
Table 12-3 
Notation 

Notation Description 

D; Diagonal ; X kj; matrix, containing the weighted univariate marginals; ie, 

the weighted column sums of G; (Di= G ;WG);) 
M; Diagonal "tot X Ntot matrix, with diagonal elements defined as 
0 when the ith observation is missing and missing strategy variable 7 is passive 
when the ith object is in rth category of variable jand 7th category is only 
Mz) = 4 9 used by supplementary objects (ie. when d,j;), = 0) 
y, otherwise 

Table 12-4 

Notation 

Notation Description 

M. 5)M, 

Sj I-spline basis for variable j, of order’; x (sj; + t;)(see Ramsay (1988) 

for details) 

b; Spline coefficient vector, of order 5) + ¢; 

d; Spline intercept. 

5j Degree of polynomial 

ty Number of interior knots 


The quantification matrices and parameter vectors are: 


Table 12-5 
Notation 
Notation Description 
x Object scores, of order njo1 x p 
Xw Weighted object scores (X,,, = WX) 
xi X normalized according to requested normalization option 
Y; Centroid coordinates, of order; x p. For variables with optimal 
scaling level multiple nominal, this are the category quantifications 
yj Category quantifications for variables with non-multiple scaling level, of 
order kr; 
aj Component loadings for variables with non-multiple scaling level, of order p 
an, a, normalized according to requested normalization option 
Y Collection of category quantifications (centroid coordinates) for variables 


with multiple nominal scaling level Y;), and vector coordinates for 
non-multiple scaling level (y; al 5): 
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Note: The matrices W, G;, M;, M,, and D; are exclusively notational devices; they are 
stored in reduced form, and the program fully profits from their sparseness by replacing matrix 
multiplications with selective accumulation. 


Discretization 


Discretization is done on the unweighted data. 


Multiplying 


First, the original variable is standardized. Then the standardized values are multiplied by 10 and 
rounded, and a value is added such that the lowest value is 1. 


Ranking 


The original variable is ranked in ascending order, according to the alphanumerical value. 


Grouping into a specified number of categories with a normal distribution 


First, the original variable is standardized. Then cases are assigned to categories using intervals 
as defined in Max (1960). 


Grouping into a specified number of categories with a uniform distribution 


First the target frequency is computed as divided by the number of specified categories, rounded. 
Then the original categories are assigned to grouped categories such that the frequencies of the 
grouped categories are as close to the target frequency as possible. 


Grouping equal intervals of specified size 


First the intervals are defined as lowest value + interval size, lowest value + 2*interval size, etc. 
Then cases with values in the kth interval are assigned to category k. 


Imputation of Missing Values 


When there are variables with missing values specified to be treated as active (impute mode or 
extra category), then first the /';’s for these variables are computed before listwise deletion. Next 
the category indicator with the highest weighted frequency (mode; the smallest if multiple modes 
exist), or j:; + 1 (extra category) is imputed. Then listwise deletion is applied if applicable. And 
then the j;;’s are adjusted. 


If an extra category is imputed for a variable with optimal scaling level Spline Nominal, Spline 
Ordinal, Ordinal or Numerical, the extra category is not included in the restriction according to 
the scaling level in the final phase. 


For more information, see the topic “Objective Function Optimization”. 
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Configuration 


CATPCA can read a configuration from a file, to be used as the initial configuration or as a 
fixed configuration in which to fit variables. 


For an initial configuration see step 1 in “Objective Function Optimization ” 


A fixed configuration X is centered and orthonormalized as described in the optimization 
section in step 3 (with X instead of Z) and step 4 (except for the factor ny =); and the result is 
postmultiplied with A‘/? (this leaves the configuration unchanged if it is already centered and 
orthogonal). The analysis variables are set to supplementary and variable weights are set to one. 
Then CATPCA proceeds as described in “Supplementary Variables”. 


Objective Function 


The CATPCA objective is to find object scores X and a set of Y ; (for j=1,...,m) — theunderlining 
indicates that they may be restricted in various ways — so that the function 


o(X:¥) =nz!}e“lt((X - Gj¥,) MjW(X - G)Y,)) 
Jj 


where c is p if j € J andcis 1 if j € J. 


is minimal, under the normalization restriction X M,WX = nr. twI (1 is the pxp identity 
matrix). The inclusion of M; in o(X; Y) ensures that there is no influence of passive missing 
values (missing values in variables that have missing option passive, or missing option not 
specified). M, contains the number of active data values for each object. The object scores are 
also centered; that is, they satisfy u’ M, WX — 0 with u denoting an n-vector with ones. 


Optimal Scaling Levels 
The following optimal scaling levels are distinguished in CATPCA: 
Multiple Nominal. Y= Y, (equality restriction only). 
Nominal. Y = yja ; (equality and rank — one restrictions). 
Spline Nominal. Y = yja j and; = d; + Sjb; (equality, rank — one, and splinerestrictions). 


Spline OrdinalY ; = y; a jand y; = dj + S;b; (equality, rank — one, and monotonic spline 
restrictions), with b ; restricted to contain nonnegative elements (to guarantee monotonic I-splines). 


Ordinal. Y j= yja j andy; € C; (equality, rank — one, and monotonicity restrictions). The 
monotonicity restriction y; € C; means that y; must be located in the convex cone of all 
k-vectors with nondecreasing elements. 


Numerical. Y = y; al ; andy; € L; (equality, rank — one, and linearity restrictions). The linearity 
restriction y; € L; means that y; must be located in the subspace of all /:;-vectors that are a linear 
transformation of the vector consisting of 4; successive integers. 
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For each variable, these levels can be chosen independently. The general requirement for all 
options is that equal category indicators receive equal quantifications. The general requirement 


for the non-multiple options is Y ; = y; a’ ;; that is, Y; is of rank one; for identification purposes, 


y; is always normalized so that y’ jDjyj = Nw. 


Objective Function Optimization 


Optimization is achieved by executing the following iteration scheme: 
1. Initialization I or II 
2. Update category quantifications 
3. Update object scores 
4. Orthonormalization 
5. Convergence test: repeat (2) through (4) or continue 


6. Rotation and reflection 


The first time (for the initial configuration) initialization I is used and variables that do not have 
optimal scaling level Multiple Nominal or Numerical are temporarily treated as numerical, 

the second time (for the final configuration) initialization II is used. Steps (1) through (6) are 
explained below. 


Initialization 


I. If an initial configuration is not specified, the object scores X are initialized with 

random numbers. Then X is orthonormalized (see step 4) so that u.M,WxX = 0 and 

XM, WX = ry my, yielding X_,. The initial component loadings are computed as the cross 
products of X,, and the centered original variables (1 —M, uu W/ (uM ;Wu) ) hj, rescaled 
to unit length. 


II. All relevant quantities are copied from the results of the first cycle. 

Update category quantifications; loop across analysis variables 

With fixed current values X, the unconstrained update of Y ; is 

Y;=D;'G ;x¢ 

Multiple nominal: Y ; = Pe 

For non-multiple scaling levels first an unconstrained update is computed in the same way: 
Y= D;'G jX} 


next one cycle of an ALS algorithm (De Leeuw et al., 1976) is executed for computing a rank-one 
decomposition of Y ;, with restrictions on the left-hand vector, resulting in 
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yj = Y; aj 
Nominal: yj= yj. 


For the next four optimal scaling levels, if variable j was imputed with an extra category, y; is 
inclusive category /:; in the initial phase, and is exclusive category /:; in the final phase. 


Spline nominal and spline ordinal: y;= d; + Sjbj. 


The spline transformation is computed as a weighted regression (with weights the diagonal 
elements of D;) of y; on the I-spline basis S ;. For the spline ordinal scaling level the elements of 
bj are restricted to be nonnegative, which makes y; monotonically increasing 


Ordinal: yj yj). 


The notation WMONC ) is used to denote the weighted monotonic regression process, which 
makes y; monotonically increasing. The weights used are the diagonal elements of D; and the 
subalgorithm used is the up-and-down-blocks minimum violators algorithm (Kruskal, 1964; 
Barlow et al., 1972). 


Numerical: y;< y)). 


The notation WLIN( ) is used to denote the weighted linear regression process. The weights 
used are the diagonal elements of Dj. 


Next y; is normalized (if variable j was imputed with an extra category, y; is inclusive category 
k; from here on): 


yj=nd’y; (yjDiy;) 

Then we update the component loadings: 

a} = ng Y Dyy} 

Finally, we set Y/=yja ‘ ; 

Update object scores 

First the auxiliary score matrix Z is computed as 

Z <— ©j;MjGj;Y; 

and centered with respect to W and M,: 

X* = (1 — M,uu W/ (uM. Wu) )Z 

These two steps yield locally the best updates when there would be no orthogonality constraints. 


Orthonormalization 
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To find an M,-orthonormal X* that is closest to X* in the least squares sense, 

we use for the Procrustes rotation (Cliff, 1966) the singular value decomposition 

mae’ Ma? W/2X* = KA!2L’, then yields ni/?mii?M_"/?W!/2KL -orthonormal weighted 
object scores: Xf) <— ni/?m,My!WX*LA7!/2L’, and X+ = W-'X>%. The calculation of L 
and A is based on tridiagonalization with Householder transformations followed by the implicit 
QL algorithm (Wilkinson, 1965). 


Convergence test 


The difference between consecutive values of the quantity 


TFIT = (pnw) >_ ujtr (y';D;Y;) | ~ vja ja; 
tJ 


ged 


is compared with the user-specified convergence criterion ¢ - a small positive number. It can be 
shown that TFIT = m,,.; + pmw»2 — o(X;Y). Steps (2) through (4) are repeated as long as the 
loss difference exceeds s. 


After convergence TFIT is also equal to tr (A‘/*), with A as computed in the Orthonormalization 
step during the last iteration. (See also “Model Summary ” and variable correlations “Correlations 
and Eigenvalues ” for interpretation of A!/”). 


Rotation and reflection 


To achieve principal axes orientation, X* is rotated with the matrix L. In addition the sth column 
of X* is reflected if for dimension s the mean of squared loadings with a negative sign is higher 
than the mean of squared loadings with a positive sign. Then step (2) is executed, yielding the 
rotated and possibly reflected quantifications and loadings. 


Supplementary Objects 


To compute the object scores for supplementary objects, after convergence the category 
quantifications and object scores are again updated (following the steps in “Objective Function 
Optimization ”), with the zero’s in W temporarily set to ones in computing Zand X*. Ifa 
supplementary object has missing values, passive treatment is applied. 


Supplementary Variables 


The quantifications for supplementary variables are computed after convergence. For 
supplementary variables with multiple nominal scaling level, the Update Category Quantification 
step is executed once. For non-multiple supplementary variables, an initial a; is computed as 

in the Initialization step. Then the rank-one and restriction substeps of the Update Category 
Quantification step are repeated as long as the difference between consecutive values of 

a ja; exceeds .00001, with a maximum of 100 iterations. For more information, see the topic 
“Objective Function Optimization ” on p. 85. 
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Diagnostics 


The procedure produces the following diagnostics. 


Maximum Rank (may be issued as a warning when exceeded) 


The maximum rank pmax indicates the maximum number of dimensions that can be computed 
for any dataset. In general 


Pmax = min (n —1, eore ky) —m,— ma) 


if there are variables with optimal scaling level multiple nominal without missing values to be 
treated as passive. If variables with optimal scaling level multiple nominal do have missing values 
to be treated as passive, the maximum rank is 


Pmax = min (n —1, (33963 ky) —max(mg3,1l)— mo) 


with m3 the number of variables with optimal scaling level multiple nominal without missing 
values to be treated as passive. 


Here /:; is exclusive supplementary objects (that is, a category only used by supplementary objects 
is not counted in computing the maximum rank). Although the number of nontrivial dimensions 
may be less than pmax when m=2, CATPCA does allow dimensionalities all the way up to Pmax.- 
When, due to empty categories in the actual data, the rank deteriorates below the specified 
dimensionality, the program stops. 


Descriptives 


The descriptives tables gives the weighted univariate marginals and the weighted number of 
missing values (system missing, user defined missing, and values less than or equal to 0) for 
each variable. 


Fit and Loss Measures 
When the HISTORY option is in effect, the following fit and loss measures are reported: 
Total fit (VAF). This is the quantity TFIT as defined in the Convergence Test step. 
Total loss. This isr(X; Y), computed as the sum of multiple loss and single loss defined below. 
Multiple loss. This measure is computed as 


TMLOSS = (1mwi + PMw2) — ((rwp)* vies vju(Y'j;D,Y;) tn) Digg Ui tt (y';D;Y;)) 


aa w fog 


Single loss. This measure is computed only when some of the variables are 


single: SLO§S $0. ; vit (¥';D;¥;) - y.. via 5a; 


jéd 
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Model 
Summary 


Model summary information consists of Cronbach’s alpha and the variance accounted for. 


Cronbach’s Alpha 
Cronbach’s Alpha per dimension (s=1....,p): 
Ay = My (x! oe 1) / Ore (My — 1)) 
Total Cronbach’s Alpha is 
a2 My (5A2/? = 1)/E,A1/?? (my — 1) 
with \, the sth diagonal element of A as computed in the Orthonormalization step during the last 
iteration. 
Variance Accounted For 


Variance Accounted For per dimension (s=1....,p): 


Multiple Nominal variables: 


VAF1, = nj!) ujtr (¥(;)sDi¥ue), (% of variance is VAF1, x 100/m.1), 
jed 


Non-Multiple variables: 


VAF2, = >. Uj Ai ss (% of variance is VAF2, x 100/1m2,,,2). 


jEd 


Eigenvalue per dimension: 
\L/?=VAFI ,+VAF2, 


with \, the sth diagonal element of A as computed in the Orthonormalization step during the 
last iteration. (See also the Convergence Test step and variable correlations “Correlations and 
Eigenvalues ” for interpretation of A!/”). 


The Total Variance Accounted For for multiple nominal variables is the mean over dimensions, 
and for non-multiple variables the sum over dimensions. So, the total eigenvalue is 


tr (Al/?) =p-!SVAF1,+NsVAF2,. 


If there are no passive missing values, the eigenvalues A!/? are those of the correlation matrix 
(see “Correlations and Eigenvalues ”) weighted with variable weights: 
Ww Ww 1/2 


= Ww _ Ww _ 
155 = U;T{j- and Ty, = Tj; = 0; 5 
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If there are passive missing values, then the eigenvalues are those of the matrix m wQ cMz'Qc, 
with Qc = ny!” (1 —M.uu W/ (uM, a) Q, (see “Correlations and Eigenvalues ”) which 
is not necessarily a correlation matrix, although it is positive semi-definite. This matrix is 
weighted with variable weights in the same way as R. 


Variance Accounted For 


The Variance Accounted For table gives the VAF per dimension and per variable for centroid 
coordinates, and for non-multiple variables also for vector coordinates (see “Quantifications”). 


Centroid Coordinates 


VAF;, = vjtr (¥' jsD;¥;,) 


Vector Coordinates 


VAF;. = vj;A5 5 for j¢ J 


Correlations and Eigenvalues 
Before Transformation 


R = nz'!H cWHc, with Hc weighted centered and normalized H. For the eigenvalue 
decomposition of R (to compute the eigenvalues), first row j and column j are removed from R if j 
is a supplementary variable, and then ,;; is multiplied by (v;;)'/”. 


If passive missing treatment is applicable for a variable, missing values are imputed with the 
variable mode, regardless of the passive imputation specification. 


After Transformation 


When all analysis variables are non-multiple, and there are no missing values, specified to be 
treated as passive, the correlation matrix is: 


R= i Q’ wa, with qj = Gijy;. 


The first p eigenvalues of R equal A'/”. (See also the Convergence Test step and “Model 
Summary” for interpretation of A!/?). When there are multiple nominal variables in the analysis, 
p correlation matrices are computed (s=1....,p): 


' 
Ry.) = nz,'Q (5, WQi,); 
{2 


_ nl?G,¥cye(¥ yyDsYu) 


with 4(s)) ~ @i¥% for non-multiple variables and 4+) for 


multiple nominal variables. 
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Usually, for the higher eigenvalues, the first eigenvalue of R,,) is equal to x “(see “Model 
Summary ”). The lower values of A!/? are in most cases the second or subsequent eigenvalues of 
Ry). 


If there are missing values, specified to be treated as passive, the mode of the quantified variable 
or the quantification of an extra category (as specified in syntax; if not specified, default (mode) is 
used) is imputed before computing correlations. Then the eigenvalues of the correlation matrix do 
not equal A!/? (see Model Summary section). The quantification of an extra category for multiple 
nominal variables is computed as 


—1 7 
Y Qa aie = (ic wi) ier Wi Liss 
with J an index set recording which objects have missing values. 


For the quantification of an extra category for non-multiple variables first Y(;) ae is computed 
as above, and then ; 


-1 
[2 2 
Y(k; b1)j = lw ) ais ; 45s V (5), 410" 
s 


s 


For the eigenvalue decomposition of R (to compute the eigenvalues), first row j and column j are 
removed from R if j is a supplementary variable, and then ;; is multiplied by (v; vj)! : 


Object Scores and Loadings 


If all variables have non-multiple scaling level, normalization partitions the first p singular values 
of n° W!/2QvV!/? divided by m,, over the objects scores X and the loadings A, with Q the 
matrix of quantified variables (see “Correlations and Eigenvalues”), and V a diagonal matrix with 
elements v;. The singular value decomposition of ne W!/2QV?/2 is 

SVD (no? wiqv'/?) -Ke!?L’, 

With X = K,, (the subscript p denoting the first p columns of K) and A = (L®!/? i XA’ gives 
the best p-dimensional approximation of n;,'/°W!/2QV!/2. 


9 


The first p singular values ® ot equal A'/*, with A as computed in the Orthonormalization 
step during the last iteration. (See also the Convergence Test step and “Model Summary ” for 
interpretation of A!/?), 

For partitioning the first p singular values we write 


(Ke! oy’) - K,®, wy “Lr —_ K,A“ tA? 2 ae (at+b=1, see below). 
p 


During the optimization phase, variable principal normalization is used. Then, after convergence 
X =n? W-1/2K, and A = V-/2L,A™4. 
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If variable principal normalization is requested, X" = X and A” = A, else X™ = XA“’" and 
A® — AA!/4(-1) with a=(1+q)/2, b=(1—q)/2, and q any real value in the closed interval [-1,1], 
except for independent normalization: then there is no q value and a=b=1. q=—1 1s equal to 
variable principal normalization, q=1 is equal to object principal normalization, q=0 is equal to 
symmetrical normalization. 


When there are multiple nominal variables in the analysis, there are p matrices Q,,), s=1,...p, (see 
“Correlations and Eigenvalues ”). Then one of the singular values of n7,'/° W!/2Q,,)V'/2 equals 
AS. 


If a variable has multiple nominal scaling level, the normalization factor is reflected in the 
centroids: Y}! = Y,A1/40-), 


Quantifications 


For variables with non-multiple scaling level the quantifications y; are displayed, the vector 
coordinates y ; (al) , and the centroid coordinates: Y; with variable principal normalization, 
D> G ,;wxh with one of the other normalization options. For multiple nominal variables the 
quantifications are the centroid coordinates ¥?}: 


If a category is only used by supplementary objects (i.e. treated as a passive missing), only 
centroid coordinates are displayed for this category, computed as y,j),. = 7 2 Migs Ss" x! for 


icl 
variables with non-multiple scaling level and y,;),, = nu, gr ~ x, A‘/*(°-) for variables with 


ie 
multiple nominal scaling level, where y;,;),. is the rth row of Yj, nj, is the number of objects that 


have category r, and I is an index set recording which objects are in category r. 


Residuals 
For non-multiple variables, Residuals gives a plot of the quantified variable j(Gjy;)the 


approximation, Xa,;. For multiple nominal variables plots per dimension are produced of 
Gjy!), against the approximation x} 


s° 


Projected Centroids 


The projected centroids of variable | on variable j, j € J, are 


; 1/2 
Y)a; (a aj) 
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Scaling factor Biplot, triplot, and loading plot 


In plots including both the object scores or centroids and loadings (loading plot including 
centroids, biplot with objects and loadings, and triplot with objects, centroids and loadings), the 
object scores and centroids are rescaled using the following scaling factor: 


Pp 


23 max (are, they al) 


s=1 


Scalefactor = 3 


ba |anin (xi, oe gt) + (max (x7, see a) 


s=1 
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CATREG (Categorical regression with optimal scaling using alternating least squares) quantifies 
categorical variables using optimal scaling, resulting in an optimal linear regression equation 

for the transformed variables. The variables can be given mixed optimal scaling levels and no 
distributional assumptions about the variables are made. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Number of analysis cases (objects) 
Nw Lg 
Weighted number of analysis cases: y. Wi 
i=1 
Ntot Total number of cases (analysis + supplementary) 
Wi Weight of object i; w; = 1 if cases are unweighted; w,; = 0 if object iis 
supplementary. 
WwW Diagonal not X Nto¢ Matrix, with w; on the diagonal. 
Pp Number of predictor variables 
m Total number of analysis variables 
r Index of response variable 
Jp Index set of predictor variables 
H The data matrix (category indicators), of order io: x m, after discretization, 
imputation of missings , and listwise deletion, if applicable. 
D Number of dimensions 
M Lasso penalty 
h2 Ridge penalty 
For variable j; 7 = 1, ...,m 
ky Number of categories of variable j (number of distinct values in h;, thus, 
including supplementary objects) 
G; Indicator matrix for variable j, of order nyo: x kj 
The elements of G; are defined asi =1,.... i437 =1,...,/; 
: 1 when the tth object is in the rth category of variable 7 
¢ _= . . . : 5 
G(j)in 0 when the ith object is not in the rth category of variable j 
D; Diagonal kj; x k,; matrix, containing the weighted univariate marginals; ie, 
the weighted column sums of G; (D;= G ;WG);) 
f Vector of degrees of freedom for the predictor variables, of order p 
Sj I-spline basis for variable j, of order’; x (sj; + t;) (see Ramsay (1988) 
for details) 
b; Spline coefficient vector, of order s; +t; 


d; Spline intercept. 
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Sj Degree of polynomial 


tj Number of interior knots 


The quantification matrices and parameter vectors are: 


yr Category quantifications for the response variable, of order k:,- 
Yj Category quantifications for predictor variable j, of order k:; 
b Regression coefficients for the predictor variables, of order p 
v 


Accumulated contributions of predictor variables: s b;Giy; 


jETp 


Note: The matrices W, Gj, and D; are exclusively notational devices; they are stored in reduced 
form, and the program fully profits from their sparseness by replacing matrix multiplications 
with selective accumulation. 


Discretization 


Discretization is done on the unweighted data. 


Multiplying 


First, the original variable is standardized. Then the standardized values are multiplied by 10 and 
rounded, and a value is added such that the lowest value is 1. 


Ranking 


The original variable is ranked in ascending order, according to the alphanumerical value. 


Grouping into a specified number of categories with a normal distribution 


First, the original variable is standardized. Then cases are assigned to categories using intervals 
as defined in Max (1960). 


Grouping into a specified number of categories with a uniform distribution 
First the target frequency is computed as divided by the number of specified categories, rounded. 


Then the original categories are assigned to grouped categories such that the frequencies of the 
grouped categories are as close to the target frequency as possible. 


Grouping equal intervals of specified size 


First the intervals are defined as lowest value + interval size, lowest value + 2*interval size, etc. 
Then cases with values in the kth interval are assigned to category k. 
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Imputation of Missing Values 


When there are variables with missing values specified to be treated as active (impute mode or 
extra category), then first the /';’s for these variables are computed before listwise deletion. Next 
the category indicator with the highest weighted frequency (mode; the smallest if multiple modes 
exist), or j:; + 1 (extra category) is imputed. Then listwise deletion is applied if applicable. And 
then the /;;’s are adjusted. 


If an extra category is imputed for a variable with optimal scaling level Spline Nominal, Spline 
Ordinal, Ordinal or Numerical, the extra category is not included in the restriction according to 
the scaling level in the final phase. 


For more information, see the topic “Objective Function Optimization”. 


Objective Function 


The CATREG objective is to find the set of y,, b, and y;, 7 € J», so that the function 


is minimal, under the normalization restriction Div, =n,. The quantifications of the 
response variable are also centered; that is, they satisfy u WG,-y,. = 0 with u denoting an 
n-vector with ones. 


With regularization, the loss function is subjected to: 


P 
[87 < ty for Ridge, 
Jed, 


[ |B) <tr for Lasso, 


P a. 
J \8j|<tiand ff 9° < ty for Elastic Net. 
Ge Ns pede 


The constrained loss functions can also be written as penalized loss functions: 


Pe 
ridge +r. ff B 
J ETp - 
P 
LAO T+ dy ff sign(3,)3, 
jEdp 
P. P 
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Optimal Scaling Levels 


The following optimal scaling levels are distinguished in CATREG: 


Nominal. Equality restrictions only. 


: : ; =d;+S;a i : stats 
Spline Nominal. *) ~ “7 ~ °29 (equality and spline restrictions). 

F i ; =d;+S,a; F F ‘ Za: 3 3 
Spline Ordinal. ¥3 ~ i ~ 3% (equality and monotonic spline restrictions), with ajrestricted to 
contain nonnegative elements (to guarantee monotonic I-splines). 


Ordinal.y ; € C; (equality and monotonicity restrictions). The monotonicity restriction 
y; € C; means that y; must be located in the convex cone of all /:;-vectors with nondecreasing 
elements. 


Numerical.y ; ¢ L; (equality and linearity restrictions). The linearity restriction y; € L; means 
that y; must be located in the subspace of all /:;-vectors that are a linear transformation of the 
vector consisting of /:; successive integers. 


For each variable, these levels can be chosen independently. The general requirement for all 
options is that equal category indicators receive equal quantifications. For identification purposes, 
y; is always normalized so that y’ ;Djyj = Nw- 
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ee ge uN 


Optimization is achieved by executing the following iteration scheme: 
Initialization I or I 

Update category quantifications response variable 

Update category quantifications and regression coefficients predictor variables 


Convergence test: repeat (2) through (3) or continue 


Steps (1) through (4) are explained below. 


Initialization 

I. Random 

The initial category quantifications y,; (for j= 1, ...,m) are defined as the /:; category indicators 
of variable j, normalized such that u WG, y; — Vandy;Djy; = n.,, , and the initial regression 
coefficients are the correlations with the response variable. 


II. Numerical 


In this case, the iteration scheme is executed twice. In the first cycle, (initialized with initialization 
I) all variables are treated as numerical. The second cycle, with the specified scaling levels, starts 
with the category quantifications and regression coefficients from the first cycle. 
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III. Multistart (ALL) 


Choosing all multiple systematic starts guarantees obtaining the global optimal solution when 

the spline ordinal or ordinal scaling level is specified for one or more predictors (Van der Kooij, 
Meulman, and Heiser, 2006). When this option is chosen, the iteration scheme is executed 

2° times, where s is the number of predictor variables with (spline) ordinal scaling level and 2° is 
the number of all possible sign patterns for the regression coefficients of the predictor variables 
with (spline) ordinal scaling level. Each execution of the iteration scheme starts with the same 
initial category quantifications and regression coefficients (initialized with initialization I), but 
with different sign patterns for the coefficients. In the iteration process, the signs are held fixed. 
Finally, the iteration scheme is executed one more time using the optimal sign pattern (the pattern 
resulting in the highest R2, or RSQres" if regularization is applied). 


IV. Multistart (value) 


When a threshold value is specified with the multiple systematic starts option, the iteration scheme 
is executed twice for a selection of sign patterns for the regression coefficients of the predictor 
variables with (spline) ordinal scaling level. The sign patterns are selected by a combination of a 
percentage of loss of variance strategy and a hierarchical strategy (Van der Kooij, Meulman, and 
Heiser, 2006). 


s 


The maximum number of sign patterns with this option is 1 + Ss i. 


i=l 


In the first cycle (initialized with initialization I) all variables are treated as nominal. The second 
cycle, with the specified scaling levels, starts with the category quantifications and regression 
coefficients from the first cycle. After one iteration in the second cycle, the decrease in variance 
going from the last iteration in the first cycle to the first iteration in the second cycle is determined 
for predictors with (spline) ordinal scaling level. If the percentage of decrease for a predictor is 
above the specified threshold value, the predictor is allowed to have a negative sign. Then the 
second cycle continues a number of times: one time with the regression coefficient for all (spline) 
ordinal predictor positive and q times with the regression coefficient for one (spline) ordinal 
predictor negative, where q is the number of predictors with (spline) ordinal scaling level that are 
allowed to have a negative sign. If the ‘all positive’ sign pattern gives a better result (higher R2, or 
RSQ'esu if regularization is applied) then the ‘one negative’ signs patterns, the iteration scheme is 
executed one more time using the ‘all positive’ sign pattern. Else, if one of the ‘one negative’ 
signs patterns gives a better result than the ‘all positive’ sign pattern, the best ‘one negative’ signs 
pattern is selected and the second cycle is repeated for the ‘two negatives’ signs patterns: the 
patterns formed by adding one more negative sign to the best ‘one negative’ signs pattern. 

Then, the results of the ‘two negatives’ signs patterns are compared to the ‘one negative’ signs 
pattern and the ‘one negative’ signs pattern is selected if its result is better. Else, the second cycle 
is repeated for the ‘three negatives’ signs patterns, and so on. 


V. Fixsigns 


In this case, the iteration scheme is executed twice. In the first cycle, (initialized with initialization 
I) all variables are treated as nominal. The second cycle, with the specified scaling levels, 
starts with the category quantifications and regression coefficients from the first cycle and fixed 
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signs (read from a user-specified file) for the regression coefficients of the predictor variables 
with (spline) ordinal scaling level. 


Update category quantifications response variable 

With fixed current values y;, 7 € J, the unconstrained update of y,. is 
¥, =D7'G’,.Wv 

Nominal: y; =y, 


For the next four optimal scaling levels, if variable j was imputed with an extra category, y; is 
inclusive category k:, in the initial phase, and is exclusive category /,. in the final phase. 


Spline nominal and spline ordinal: y*= d,. + S,a,. 


The spline transformation is computed as a weighted regression (with weights the diagonal 
elements of D,.) of y,- on the I-spline basis S,.. For the spline ordinal scaling level the elements of 
a, are restricted to be nonnegative, which makes y* monotonically increasing 


Ordinal: y*—y,). 


The notation WMONC ) is used to denote the weighted monotonic regression process, which 
makes y; monotonically increasing. The weights used are the diagonal elements of D,. and the 
subalgorithm used is the up-and-down-blocks minimum violators algorithm (Kruskal, 1964; 
Barlow et al., 1972). 


Numerical: y*< y,). 


The notation WLINC( ) is used to denote the weighted linear regression process. The weights 
used are the diagonal elements of D,. 


Next y; is normalized (if the response variable was imputed with an extra category, y* is inclusive 
category k, from here on): 


Update category quantifications and regression weights predictor variables 


For updating a predictor variable j, 7 € J,, first the contribution of variable j is removed from 
v:v; = v —b;Gjy, Then the unconstrained update of y; is 


Vj — D;'G j;W(G,y, —_ vj) 
Next y; is restricted and normalized as in step (2) to obtain y Ha 


Finally, we update the regression coefficient 


{ -l~’ 4 
bj = nw y Diy; 
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Regularized regression coefficients are obtained as 


= 8° . 
B; = zy for Ridge, 
B; = (93 -- Wj), = BF - = if3> > dand3* + rife < 0 for Lasso, and 
s—Abw, BES jae, Bj+H) 
oF (8) ) a | ; 2) eas > oana + #) gs < 0 for Elastic Net (van der 


my 1+r2 
Kooij, 2007). 
Convergence test 


The difference between consecutive values of the apparent Prediction Error is compared with the 
user-specified convergence criterion € a small positive number. 


The difference between consecutive values of the quantity 


Se 


APE = n,,! (Gx — | Gs] w(ex - f Gy] 
JE Jp i Ip 


Without regularization, APE is equal to 1 minus the squared multiple regression coefficient. Steps 
(2) and (3) are repeated as long as the APE difference exceeds s. 


Diagnostics 


The procedure produces the following diagnostics. 


Descriptive Statistics 


The descriptives tables gives the weighted univariate marginals and the weighted number of 
missing values (system missing, user defined missing, and values less than or equal to 0) for 
each variable. 


Fit and error measures 


The squared multiple regression coefficient and the Apparent Prediction Error for each iteration 
are reported in the History table. Also, the decrease in APE for each iteration is reported. 


Summary Statistics 


The following model summary statistics are available. 


Multiple R 


/2 


R= (Gyr) Wv(n wv Wv) 
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Multiple R Square 
R? 
Adjusted Multiple R Square 


1-(1 — R?)(nw = 1)(rw —1 —u'f) 


with u a p-vector of ones. 


Regularization “R Square” (1-Error) 


RSQ'e8" _ 1 APE 


regu 


Without regularization, RSQ is equal to R2. 


Apparent Prediction Error 


APE as computed in the convergence step in the last iteration of the optimization algorithm. For 
more information, see the topic “Objective Function Optimization”. 


Expected Prediction Error 


The expected prediction error is computed for the standardized (quantified) data. Only when for 
all variables the numeric scaling level is specified, the EPE is computed for the raw data as well. 


Supplementary objects (test cases) 


The expected prediction error for the training data (active cases) is 


2 


i 3 se 2 
EpEtuain = fe (G,y,); — B;Gyy; 
O i=] JET, 
. z 


and the standard error is 


/2 
a 1/2 


sEtrain = 23 Wj (epee ) 2 
ne 


Ww i=] 
For the test data (supplementary objects), the expected prediction error is 


Ntof —n— 
ae ieS 


test i 3 
BPE = ————_9 1 (Gryn)i- | 0 55 Gi; 
jE, i 


where S is the index set of supplementary objects. 
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1/2 
l 


sEtest — si (epeitest ts ppptest} 


(Ntot _ n) icS 


For the estimation of the quantification of a supplementary category (a category only occurring 
with supplementary cases), see the Quantification section below. 


Multiplying EPEU@, cptrain ppptest ang sptest with 


(the variance of the response variable for the active cases) yields the EPE and SE for the raw data. 


Resampling, .632 Bootstrap 


Bootstrap datasets are created by randomly drawing (with replacement) n times from the active 
objects (training data), including the object (case) weights. 


a (.632) 


EPEDOOt _ pay”) _ aE 4 OP 
where the optimism is estimated as 


OP = 632(Em” e3 ar) 


=—(1 : ae Ae a 
and Err’ ) the leave-one-out bootstrap estimate of prediction error, is 


s) 


eee l ah ; P| sed 
En = de cy. m( (Gat), (Eo aeas) } meme 


Nw j=l beC-' jEJy 


i 
where (’~ is the set of indices of the bootstrap samples 6(b = 1,..., B) that 
(a) do not contain observation i, 


(b) do contain the categories that apply to observation i for variables with nominal or ordinal 
transformations, 


(c) do not require extrapolation for observation I for variables with spline transformations. 


n\,’ is the number of observations for which |c'~"| £0. (The set |C’~'| may become empty 
if, for example, observation i has one of the extreme categories on a variable with a_ spline 
transformation, and this category has a frequency of one. Then each bootstrap sample that does 
not contain this observation, also does not contain the extreme category; thus for observation i 
all bootstrap samples are excluded.) 
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The Standard Error is computed as 


1 — a: 2 
spboot — ( — y w; (Em — Er’) 
Ths, 
el 


Adding multiplication with the variance of the response variable for the cases in bootstrap sample 
b in the computation of Er... w,var (h®)(...)), yields the EPE and SE for the raw data. 


Resampling, Cross-validation 


The data are randomly divided into K disjoint subsets of the active objects (training data), 
including the object (case) weights. 


n 


EPECV = => wi (Git), — | 6 °Giy5" 


i=l ick je» ; 


where k (/ = 1,..., A’) indexes the kth subset and —/: the remaining part of the data. 
The Standard Error is computed as 


n 2 
CV l CV)" 
sECY — yl wi (EES ) 
~ =1 


Adding multiplication with the variance of the response variable for the cases with the kth part 
removed in the computation of EPECY(... w,var (h>*)(...)), yields the EPE and SE for the 
raw data. 


Quantifications of categories that do not occur in a bootstrap sample or in the data with the kth 
part removed, are estimated as for supplementary categories (see “Quantifications ”). 


ANOVA Table 


Sum of Squares df Mean Sum of Squares 


Regression nw R? uf nw R? (u'f) me 


Residual nw (1 —R?) nw -1—uf nw (1— R?)(mw —1-u'f) 


F = MSreg/MS es 
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Correlations and Eigenvalues 


Before transformation 


R= 7, H cWHe , with Hc weighted centered and normalized H excluding the response 
variable. 


After transformation 


R = n7'Q WQ, the columns of Q are qj = Gjy;, 7 € Jp. 


Statistics for Predictor Variables 


The following statistics are produced for each predicted variable. 


Beta 


The standardized regression coefficient is Betaj= b,. 


Standard Error Beta 


The standard error of Betaj is estimated from 1000 bootstrap samples. 


Degrees of Freedom 
The degrees of freedom for a variable depend on the optimal scaling level: 
Numerical. fj = 1. 
Spline ordinal, spline nominal. fj = s; + t; minus the number of elements equal to zero in aj. 


Ordinal, nominal. /; = the number of distinct values in y; minus 1. 
F-Value 


= (Beta, /SE (Beta,) )° 


Zero-order correlation 


Correlations between the transformed response variable G,.y,. and the transformed predictor 
variables Gjyj: 


rj =nz'(Gpyr) WGjy; 


Ww 


Partial correlation 


PartialCorrj= b; ((1/t;)(1 — R?) 4 be) a 


CATREG 
with ¢; the tolerance for variable j (see “Tolerance”). 


When a regularization method is applied, the OLS coefficients are computed as 
O* -1 : 
a = (n wR) Q W(G,y;) 


with R the correlation matrix after transformation and R~! is computed using the eigenvectors 
and eigenvalues of R,,, where R.,, is the correlation matrix of the predictors that have regression 


coefficients > 0, and R2 is computed as 
; ¢ =1/2\" 
(ro WwQ;" (n w( B*) Was") ) 


Part correlation 
1/2 
PartCorrj= bjt; 
with ¢; the tolerance for variable j (see “Tolerance”). 


For computation of the OLS coefficients if regularization is applied, see “Partial correlation”. 


Importance 
Pratt’s measure of relative importance (Pratt, 1987) 
Impj= bj Pry / R? 


The relative importance is only displayed if no regularization is applied. 


Tolerance 


The tolerance for the optimally scaled predictor variables is given by 


eas 
t I 


. Pjj 


with r,,, the jth diagonal element of R.,, where R., is the correlation matrix of predictors that 
have regression coefficients > 0. 


The tolerance for the original predictor variables is also reported and is computed in the same 
way, using the correlation matrix for the original predictor variables, discretized, imputed, and 
listwise deleted, if applicable. 


Quantifications 


The quantifications are yj, j=1,...,m. 
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Supplementary objects 


The category indicators of supplementary objects are replaced by the quantification of the 
categories if these categories also appear in the active data. If a category is only used by 
supplementary objects, the category quantification is estimated by interpolation for variables 
with numeric or spline scaling level if the supplementary category lies within the range of the 
categories in the active data. If the variable has numeric scaling level and the non-occurring 
category lies outside the range of categories in the active data, then extrapolation is applied. In all 
other cases, the category indicator is replaced by a system-missing value. 


Predicted and residual values 
There is an option to save the predicted values v and the residual values Gy, —v. 


Whether for a supplementary object the predicted and residual value can be computed, depends on 
whether all categories of the object are quantified (which is the case if all categories also appear 
with the active objects) or can be estimated by inter- or extrapolation (see “Quantifications”). 


Residual Plots 


The residual plot for predictor variable j displays two sets of points: unnormalized quantifications ( 
b;y,) against category indicators, and residuals when the response variable is predicted from all 
predictor variables except variable j Gy, — (v —b;Gjy;)) against category indicators. 


Regularization 


If regularization is specified, all above diagnostics apply to the selected or specified regularized 
model. If more than one model is specified (more than one penalty value), diagnostics for each 
model can be requested. 


Statistics 


APE (see “Apparent Prediction Error”), EPE (see “Expected Prediction Error’), and the 
Standardized sum of coefficients for each model. 


The standardized sum of coefficients are computed as 


P 


TE tp for Ridge 
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P 
f sign(s;)s; 


for Lasso and Elastic Net 


J sign(s>)s- 
jEdp 


Coefficients 


The regularized standardized coefficients for each model. 


Paths 


The regularized standardized coefficients are plotted on the y-axis against the standardized sum 
of coefficients for each model on the x-axis. For the Elastic net, multiple plots are produced: a 
Lasso paths plot for each specified value of the ridge penalty. 
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CCF Algorithms 


CCF computes the cross-correlation functions of two or more time series. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table 14-1 
Notation 
Notation Description 
X,Y Any two series of length n 
Pay(h) Sample cross correlation coefficient at lag k 
S, Standard deviation of series X 
Sy Standard deviation of series Y 
Cry(k) Sample cross covariance at lag k 


Cross Correlation 


The cross correlation coefficient at lag k is estimated by 


een es Cry(h) 
ry yO 
: SaOy 
where 
n—k 
tS" (1 -—Z)(yr4e—9), &=O0,1,2,... 
Cry(k) = ae 


n+k 


LS° (yu —W)\(a1-n —7), b= —1,-2.... 
t=1 


—\2 
y n (yi — Y) 


The cross correlation function is not symmetric about k = 0. 


Approximate standard error of 1,..,(4°) is 


1 


nm — 


CCF Algorithms 
The standard error is based on the assumption that the series are not cross correlated and one of 


the series is white noise. (The general formula for the standard error can be found in (Box and 
Jenkins, 1976), p. 376, 11.1.7.) 
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CLUSTER produces hierarchical clusters of items based on distance measures of dissimilarity or 
similarity. 


Cluster Measures 


For more information, see the topic “Proximities Measures”. 


Clustering Methods 


The cluster method defines the rules for cluster formation. For example, when calculating the 
distance between two clusters, you can use the pair of nearest objects between clusters or the pair 
of furthest objects, or a compromise between these methods. 


Notation 


The following notation is used unless otherwise stated: 


Table 15-1 
Notation 


Notation Description 


S Matrix of similarity or dissimilarity measures 
Sij Similarity or dissimilarity measure between cluster i and cluster j 
N; Number of cases in cluster i 


General Procedure 


Begin with N clusters each containing one case. Denote the clusters 1 through N. 


m Find the most similar pair of clusters p and (p > q). Denote this similarity s,,,. If a 
dissimilarity measure is used, large values indicate dissimilarity. If a similarity measure 
is used, small values indicate dissimilarity. 


m= Reduce the number of clusters by one through merger of clusters p and q. Label the new 
cluster (= q) and update similarity matrix (by the method specified) to reflect revised 
similarities or dissimilarities between cluster t and all other clusters. Delete the row and 
column of S corresponding to cluster p. 


m= Perform the previous two steps until all entities are in one cluster. 


m™ For each of the following methods, the similarity or dissimilarity matrix S is updated to reflect 
revised similarities or dissimilarities (s;,) between the new cluster t and all other clusters r 
as given below. 


Average Linkage between Groups 


Before the first merge, let V; = 1 fori = 1 to N. Update s,,. by 


111 
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Stir = Spr + Sqr 
Update NV; by 
N; — Np + Ng 


and then choose the most similar pair based on the value 
Siz /(NiN;) 


Average Linkage within Groups 


Before the first merge, let SU; =0 and N;=1 fori = 1 to N. Update s,,. by 
Str = Spr + Sqr 

Update SUM, and N; by 

SUM; = SUM, + SUM, + Sp 

Ni = Np + Ng 

and choose the most similar pair based on 


SUM, + SUM; + si; 
((Ni + Nj)(Ni + Nj — 1))/2 


Single Linkage 
Update s,,. by 


— min (S)p,,Sgr) if S is a dissimilarity matrix 
wt) max (Spr, Sgr) if S is a similarity matrix 


Complete Linkage 


Update s;,. by 


: { Max (Spr.Sgr) if S is a dissimilarity matrix 
Sip = 


mi (Spr,Sgr) if S is a similarity matrix 


Centroid Method 


Update s,,. by 
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Str = = —Syp + — Sor - Sng 
ci Np + Ng me Np + Ng = (Np + No) 


Median Method 


Update s;,. by 


Str = (Spr 2 a Sqr) /2 — Spq/4 


Ward’s Method 


Update s,,. by 


[((N>- + Np)8rp + (NN; + Nq)8rq = Nr8pq] 


Str — 


(Ni + Nr) 
Update the coefficient W by 

W =W + .58p¢q 

Note that for Ward’s method, the coefficient given in the agglomeration schedule is really the 


within-cluster sum of squares at that step. For all other methods, this coefficient represents the 
distance at which the clusters p and q were joined. 
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Cluster Evaluation Algorithms 


This document describes measures used for evaluating clustering models. 


m= The Silhouette coefficient combines the concepts of cluster cohesion (favoring models which 
contain tightly cohesive clusters) and cluster separation (favoring models which contain 
highly separated clusters). It can be used to evaluate individual objects, clusters, and models. 


m The sum of squares error (SSE) is a measure of prototype-based cohesion, while sum of 
squares between (SSB) is a measure of prototype-based separation. 


= Predictor importance indicates how well the variable can differentiate different clusters. For 
both range (numeric) and discrete variables, the higher the importance measure, the less 
likely the variation for a variable between clusters is due to chance and more likely due to 
some underlying difference. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Lik Continuous variable k in case i (standardized). 

Liks The sth category of variable k in case i (one-of-c coding). 

N Total number of valid cases. 

Nj The number of cases in cluster j. 

Y Variable with J cluster labels. 

Hjk The centroid of cluster j for variable k. 

Dj; The distance between case i and the centroid of cluster j. 

D; The distance between the overall mean wu and the centroid of cluster j. 


Goodness Measures 


The average Silhouette coefficient is simply the average over all cases of the following calculation 
for each individual case: 


(B — A) /max (A, B) 


where A is the average distance from the case to every other case assigned to the same cluster and 
B is the minimal average distance from the case to cases of a different cluster across all clusters. 


Unfortunately, this coefficient is computationally expensive. In order to ease this burden, we use 
the following definitions of A and B: 


m Ais the distance from the case to the centroid of the cluster which the case belongs to; 


= Bis the minimal distance from the case to the centroid of every other cluster. 
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Distances may be calculated using Euclidean distances. The Silhouette coefficient and its average 
range between —1, indicating a very poor model, and 1, indicating an excellent model. As found 
by Kaufman and Rousseeuw (1990), an average silhouette greater than 0.5 indicates reasonable 
partitioning of data; less than 0.2 means that the data do not exhibit cluster structure. 


Data Preparation 


Before calculating Silhouette coefficient, we need to transform cases as follows: 


1. Recode categorical variables using one-of-c coding. If a variable has c categories, then it is 
stored as c vectors, with the first category denoted (1,0,...,0), the next category (0,1,0,...,0), ..., 
and the final category (0,0,...,0,1). The order of the categories is based on the ascending sort or 
lexical order of the data values. 


2. Rescale continuous variables. Continuous variables are normalized to the interval [—1, 1] using 
the transformation [2*(x—min)/(max—min)]—1. This normalization tries to equalize the 
contributions of continuous and categorical features to the distance computations. 


Basic Statistics 


The following statistics are collected in order to compute the goodness measures: the centroid 
jij, Of variable k for cluster j, the distance between a case and the centroid, and the overall mean u. 


For ji ;;, with an ordinal or continuous variable k, we average all standardized values of variable 
k within cluster j. For nominal variables, j ;;, is a vector {y;);,. } of probabilities of occurrence 
for each state s of variable k for cluster j. Note that in counting , we do not consider cases with 
missing values in variable k. If the value of variable k is missing for all cases within cluster j, 
jj is marked as missing. 


The distance Dj, between case i and the centroid of cluster j can be calculated in terms of the 
weighted sum of the distance components d?,,, across all variables; that is 


Swed 
9 2upU igh; jk 


Do.= 

Z 

J UpWizk 
where w;;; denotes a weight. At this point, we do not consider differential weights, thus 
w;j% equals 1 if the variable k in case i is valid, 0 if not. If all w;j; equal 0, set D?; =0. 


The distance component d°,,, is calculated as follows for ordinal and continuous variables 


2 
di i, = (Lip — ijk) 


For binary or nominal variables, it is 


= 


lv 2 
2 
qin = ey (tiks — Yjks) 


‘sl 
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where variable k uses one-of-c coding, and 5;, is the number of its states. 
The calculation of D; is the same as that of D;;, but the overall mean u is used in place of j.;;, and 
j.jn 1S used in place of x;,.. 
Silhouette Coefficient 
The Silhouette coefficient of case i is 


min{ Dij,7 € C4} — Dig 
max (min {Dj;,7 € C_i} , Dic,) 


where C'_; denotes cluster labels which do not include case i as a member, while c; is the cluster 


label which includes case i. If max (min {D,;,7 € C_i}, Dic,) equals 0, the Silhouette of case i is 
not used in the average operations. 


Based on these individual data, the total average Silhouette coefficient is: 


50 = Lyn tin (Dyj,d € C1) = Die 
N a max (min {Dj;,7 € C_i}, Dic,) 
Sum of Squares Error (SSE) 


SSE is a prototype-based cohesion measure where the squared Euclidean distance is used. In order 
to compare between models, we will use the averaged form, defined as: 


1 9 
Average SSE = oe be D?. 
gEC 1Ej 
Sum of Squares Between (SSB) 


SSB is a prototype-based separation measure where the squared Euclidean distance is used. In 
order to compare between models, we will use the averaged form, defined as: 


Average SSB = 53 Nj D5 
jec 
Predictor Importance 
The importance of field i is defined as 


_ —logyo (sigi) 
maxjeq (—log ig (sig;)) 
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where () denotes the set of predictor and evaluation fields, sig; is the significance or 
p-value computed from applying a certain test, as described below. If sig; equals zero, set 
sig; = MinDouble, where MinDouble is the minimal double value. 


Across Clusters 
The p-value for categorical fields is based on Pearson’s chi-square. It is calculated by 
p-value =Prob(\?3, > X7), 


where 


where Nj; = Nj.N.j/N (X). 

m If N(X) =O, the importance is set to be undefined or unknown; 

m If N;, =0, subtract one from I for each such category to obtain J : 
m= If N_; =0, subtract one from J for each such cluster to obtain J a 


m IfJ <1 orl'<1 , the importance is set to be undefined or unknown. 
The degrees of freedom are (1 -1) (a -1). 

The p-value for continuous fields is based on an F test. It is calculated by 
p-value = Prob{F (.J—1,N—J) > F}, 


where 


J 
S72 (Nj -Ysj/(N- 7) 


If N=0, the importance is set to be undefined or unknown; 
If \; = 0, subtract one from J for each such cluster to obtain J’; 


If J <1 ofV <.J, the importance is set to be undefined or unknown; 


If the denominator in the formula for the F statistic is zero, the importance is set to be 
undefined or unknown; 


m Ifthe numerator in the formula for the F statistic is zero, set p-value = 1; 


The degrees of freedom are (u' —-1,N- i), 
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Within Clusters 


The null hypothesis for categorical fields is that the proportion of cases in the categories in 
cluster j is the same as the overall proportion. 


The chi-square statistic for cluster j is computed as follows 


I 


T AT 2 
> (Nig = Nipi) 


i-| NjPi 


If Nj; = 0, the importance is set to be undefined or unknown; 

If p; = 0, subtract one from I for each such category to obtain J’; 
If J’ <1, the importance is set to be undefined or unknown. 
The degrees of freedom are d=] — 1. 


The null hypothesis for continuous fields is that the mean in cluster j is the same as the overall 
mean. 


The Student’s t statistic for cluster j is computed as follows 


(7; — 7) 


t=+4 
sj//Nj 


with d = .V; — 1 degrees of freedom. 

If Nj; <1 ors; = 0, the importance is set to be undefined or unknown; 
If the numerator is zero, set p-value = 1; 

Here, the p-value based on Student’s t distribution is calculated as 


p-value = 1 — Prob{|7'(d)| < |t|}. 
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CNLR Algorithms 


Model 


Goal 


CNLR is used to estimate the parameters of a function by minimizing a smooth nonlinear loss 
function (objective function) when the parameters are subject to a set of constraints. 


Consider the model 


s=s(x.9) 


where ¢ is a pX1 parameter vector, x is an independent variable vector, and fis a function 
of x and @¢. 


~ 


Find the estimate 9* of @ such that 9* minimizes 


F=F(y,f) 
subject to 
6 
Arq~ <q 


1 < < 

o(s) 

where F is the smooth loss function (objective function), which can be specified by the user. 

Ajis an my xX p matrix of linear constraints, and C @) is an my X lvector of nonlinear 

constraint functions. 1° = (1 Se ), where], 1 , and ]’ represent the lower bounds, 
~B~L ~N 


y ABs OTs ~N 
. . . . . 4. . : : 
linear constraints and nonlinear constraints, respectively. The upper bound wu is defined similarly. 


~ 


Algorithm 


CNLR uses the algorithms proposed and implemented in NPSOL by Gill, Murray, Saunders, and 
Wright. A description of the algorithms can be found in the User’s Guide for NPSOL, Version 4.0 
(Gill, Murray, Saunders, and Wright, 1986). 

The method used in NPSOL is a sequential quadratic programming (SQP) method. For an 
overview of SQP methods, see (Gill, Murray, and Saunders, 1981), pp. 237-242. 

The basic structure of NPSOL involves major and minor iterations. Based on the given initial 
value 6‘) of @ the algorithm first selects an initial working set that includes bounds or general 
inequality constraints that lie within a crash tolerance (CRSHTOL). At the kth iteration, the 
algorithm starts with 
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Minor Iteration 


This iteration searches for the direction P;, which is the solution of a quadratic subproblem; 
that is, P;, is found by minimizing 


g,P+4PH,.P 


subject to 

7 (A) a —(k) 
] < A,P < es 
~N AyP ~ 


where g,, is the gradient of F at g‘"), the matrix H,. is a positive-definite quasi-Newton 
approximation to the Hessian of the Lagrangian function, Av is the Jacobian matrix of the 
nonlinear-constraint vector C evaluated at 9"), and 


1 = (Ip.1i.1y) 


a) = (u B.uyp.u x) 


~ 


Ip = 1p - 


1, =1,- Aq") 


up =ug-—d" 


~ 


uz, =u, — Aza) 


Uy — uy — C (a”) 


The linear feasibility tolerance, the nonlinear feasibility tolerance, and the feasibility tolerance are 
used to decide if a solution is feasible for linear and nonlinear constraints. 
Once the search direction P;, is found, the algorithm goes to the major iteration. 


Major Iteration 
The purpose of the major iteration is to find a non-negative scalar a;, such that 


k+1 k 
gett) = 9) + apPy 


~N 
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satisfies the following conditions: 


m g‘**!) produces a “sufficient decrease” in the augmented Lagrangian merit function 


u(exs)=#(¢)-EM((¢)-») +3E0(0(#)-s) 


7 


The summation terms involve only the nonlinear constraints. The vector A is an estimate of 
the Lagrange multipliers for the nonlinear constraints. The non-negative slack variables 

{s;} allow nonlinear inequality constraints to be treated without introducing discontinuities. 
The solution of the QP subproblem defined in “Minor Iteration ” provides a vector triple that 
serves as a direction search for 9, \ and 8. The non-negative vector of penalty parameters 

(p) is initialized to zero at the beginning of the first major iteration. Function precision criteria 
are used as a measure of the accuracy with which the functions F and c; can be evaluated. 


mw 9\*~!) is close to a minimum of F along P;. The criterion is 


g (a f ») P,, 


where 7 is the Line Search Tolerance and 0 < 7 < 1. The value of 7 determines the accuracy 
with which a; approximates a stationary point of F along P;.. A smaller value of 7 produces a 
more accurate line search. 


— —ng ,Pr 


m The step length is in a certain range; that is, 
la**) — 9) || = |lay.P,|] < Step Limit 
Convergence Tests 


After a; is determined from the major iteration, the following conditions are checked: 


m= k+l < Maximum number of major iterations 


~ 


m The sequence {0 i converged at g‘**!); that is, 


a).P) || < v(t 1 le) 
m = g('+1) satisfies the Kuhn-Tucker conditions to the accuracy requested; that is, 
&- (a ») I< ve (a+ max (1 = F (a ») |g (a » )) 
and | 
||res; || < FTOL for all j, 


where g. is the projected gradient, g is the gradient of F with respect to the free parameters, 
res is the violation of the jth nonlinear constraint, FTOL is the Nonlinear Feasibility 
Tolerance, and r is the Optimality Tolerance. 


If none of these three conditions are satisfied, the algorithm continues with the Minor Iteration to 
find a new search direction. 
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Termination 


The following are termination conditions. 


Underflow. A single underflow will always occur if machine constants are computed 
automatically. Other floating-point underflows may occur occasionally, but can usually 
be ignored. 


Overflow. If the printed output before the overflow error contains a warning about serious 
ill-conditioning in the working set when adding the jth constraint, it may be possible to avoid 
the difficulty by increasing the magnitude of FTOL, LFTOL, or NFTOL and rerunning the 
program. If the message recurs after this change, the offending linearly dependent constrains 
(with index “j”) must be removed from the problem. 


Optimal solution found. 


Optimal solution found, but the requested accuracy could not be achieved, NPSOL terminates 
because no further improvement can be made in the merit function. This is probably caused 
by requesting a more accurate solution than is attainable with the given precision of the 
problem (as specified by FPRECISION). 


No point has been found that satisfies the linear constraints. NPSOL terminates without 
finding a feasible point for the given value of LFTOL. The user should check that there are no 
constraint redundancies and ensure that the value of LFTOL is greater than the precision of 
parameter estimates. 


No point has been found which satisfies the nonlinear constraints. There is no feasible point 
found in QP subproblems. The user should check the validity of constraints. If the user is 
convinced that a feasible point does exist, NPSOL should be restarted at a different starting 
point. 


Too many iterations. If the algorithm appears to be making progress, increase the value of 
ITER and rerun NPSOL. If the algorithm seems to be “bogged down”, the user should check 
for incorrect gradients. 


Cannot improve on current point. A sufficient decrease in the merit function could not be 
attained during the final line search. This sometimes occurs because an overly stringent 
accuracy has been requested; for example, Optimality Tolerance is too small or a too-small 
step limit is given when the parameters are measured on different scales. 


Please note the following: 


Unlike the other procedures, the weight function is not treated as a case replicate in CNLR. 


When both weight and loss function are specified, the algorithm takes the product of these 
two functions are the loss function. 


If the loss function is not specified, the default loss function is a squared loss function and 
the default output in NLR will be printed. However, if the loss function is not a squared loss 
function, CNLR prints only the final parameter estimates, iteration history, and termination 
message. In order to obtain estimates of the standard errors of parameter estimates and 
correlations between parameter estimates, the bootstrapping method can be requested. 
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Bootstrapping Estimates 


Bootstrapping is a nonparametric technique of estimating the standard error of a parameter 
estimate using repeated samples from the original data. This is done by sampling with 
replacement. CNLR computes and saves the parameter estimates for each sample generated. 
This results, for each parameter, in a sample of estimates from which the standard deviation is 
calculated. This is the estimated standard error. 

Mathematically, the bootstrap covariance matrix S for the p parameter estimates is 


and 6;;, is the CNLR parameter estimate of @; for the kth bootstrap sample and m is the number 


of samples generated by the bootstrap. By default, m = Ae The standard error for the jth 
parameter estimate is estimated by 


/ _Si3 


V m—1 


and the correlation between the ith and jth parameter estimates is estimated by 


V Fits 
The “95% Trimmed Range” values are the most extreme values that remain after trimming from 


the set of estimates for a parameter, the g largest, and the g smallest estimates, where g is the 
largest integer not exceeding 0.025m. 
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CONJOINT Algorithms 


This procedure performs conjoint analysis using ordinary least squares. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n The total number of regular cards in the PLAN file. 

Pp The total number of factors. 

d The number of discrete factors. 

l The number of linear factors. 

q The number of quadratic factors. 

mM; The number of levels of levels of the ith discrete factor. 


aij The jth level of the ith discrete factor i=1,...,d. 


Lj The ith linear factor, i=1....,1. 

zi The ith ideal or anti-ideal factor, i=1,...,q. 

Ti The response for the ith card, i=i,...,n. 

t The total number of subjects being analyzed at the same time. (When 


/SUBJECT is specified, t is usually 1.) 


Model 


The model for the response r; for the ith card from a subject is 
p 
ri = Bot ,2 Ujk;; 
j=l 


where wj;,,, is the utility (part worth) associated with the /:;;th level of the jth factor on the ith card. 


Design Matrix 


A design matrix X is formed from the values in the PLAN file. There is one row for each card 
in the PLAN file. The columns of the matrix are defined by each of the factor variables in the 
following manner: 


m There is a column of 1s for the constant. This column is used for the estimate of 35. 


m™ For each discrete factor containing m,; levels, m; — 1 columns are formed. Each column 
represents the deviation of one of the factor levels from the overall mean. There is a 1 in the 
column if that level of the factor was observed, a—1 if the last level of the factor was observed, 
or a 0 otherwise. These columns are used to estimate the m; — 1 values of a;;. 
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m™ For each linear factor, there is one column which is the centered value of that factor (2;; — 7;) 
These columns are used to estimate the values for {;. 


m™ For each quadratic factor there are two columns, one which contains the centered value of the 
factor (2;; —=;) , the next which contains the square of the centered factor value (2:7 =; ir 
These columns are used to estimate the values of 4°. 


Converting Card Numbers to Ranks 


If the observations are card numbers, they are converted to ranks. If card number i has a value of 
k, then r; = k. 


Estimation 


The estimates 
eee Sr Be ae wt 
(356 BY ) = (x X ) XY 
are computed by using a QR decomposition (see MANOVA) where 


ee Ee if responses are scores 
i= 4+1—r;_ if responses are ranks 


The variance-covariance matrix of these estimates is 


a (x'x) = 


where 


é 
oe (rig — Fi)? /(nt d—1!—2q-1) 


with variances 


* 


var(4j1) = var (451) = 42 ;cou(471,4j2) 42° var(4j2) 


and 


* 
*¢ 


var(4j2) = var (4j2) 


where 


* 


cov(4j1, V2) = cov (471, Wa) — 22 var(4j2) 
The value for 9 is calculated by 

Bo = 63 — UB8; — UF 2 + 422?) 

with variance 

var (40) = ad, ta 


where 


and 
var Bo cov( 35, 51) cov( 3p, 2" ) cov( 3} 
Ya = var y7 Ae 
varyay - 
va ra 
Utility (Part Worth) Values 
Discrete Factors 
Aik for }. — 1,...,mj—1 
, m;—-1 
tS i . 
e 7 >. aj, fork — mj 
j=l 
Linear Factors 


P Bs: 
Ujk = ja k 


Ideal or Anti-ideal 
Factors 


Ujk = Vj12jk + Vj27 Fh 
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Standard Errors of Part Worths 


The standard error of part worth uj, = \/var(wj,) where var(w,;,) is defined below: 


Discrete Factors 


var (Ajx) fork =1,...,mj—1 
var (ujk) = < m;—1 ms—1 

) var (Gj,) —2 ) ) cou(Gji,;1) fork = mj; 

=1 w=1 [<2 
Linear Factors 


5 . 
var (wir) = rp,.var (4;) 


Ideal or Anti-ideal Factors 


) A aA A x 
var (ujK) = z,var(Fj1) + 2z,c00(4j1, 4; 


i) 


)4 2,var(4;2) 


Importance Scores 
The importance score for factor i is 


IMP; = Te eeciai 


x: RANGE; 
‘=1 


where RANGE; is the highest minus lowest utility for factor i. If there is a SUBJECT command, 
the importance for each factor is calculated separately for each subject, and these are then 
averaged. 


Predicted Scores 
; P 
Fi = bo + St, 
j=l 


where i;,,, is the estimated utility (part worth) associated with the /:;;th level of the jth factor. 
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Correlations 
Pearson and Kendall correlations are calculated between predicted (7; ) and the observed 


(r;) responses. See the CORRELATIONS and NONPAR CORR chapters for algorithms. Pearson 
correlations for holdouts are not calculated. 


Simulations 


Each person is assigned a probability p; for each simulation i. The probabilities are all computed 
based on the predicted score (7;) for that product. The probabilities are computed as follows: 


Max Utility 


fl iffi = max(?; 
Di = {0 otherwise ‘) 


BTL 


Probabilities are averaged across respondents for the grouped simulation results. For the BTL and 
Logit methods, only subjects having all positive 7; values are used. 
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The user-specified treatment for missing values is used for computation of all statistics except, 
under certain conditions, the means and standard deviations. 


Notation 

The following notation is used throughout this section unless otherwise specified: 

Table 19-1 

Notation 

Notation Description 

N Number of cases 

Xx Value of the variable k for case I 

wy Weight for case | 

Wi Sum of weights of cases used in computation of statistics for variable k 

Wj Sum of weights of cases used in computation of statistics for variables k andj 
Statistics 


The following statistics are available. 


Means and Standard Deviations 


N 
»* WX kl 
qo 
N : 
Si = a| (>> Xie — X4We | /(We - 1) 


l=1 


Note: If no treatment for missing values is specified (default is pairwise), means and standard 
deviations are computed based on all nonmissing values for each variable. If missing values are to 
be included or listwise is chosen, that option is used for means and standard deviations as well. 


Cross-product Deviations and Covariances 


The cross-product deviation for variables i and j is 


N N N 


Cig = wi Xa Xp — (| Yo wi Xu | (|S wi Xp | /Wig 


t= l=1 l=] 


The covariance is 
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Pearson Correlation 
Ci; 
rij = Y - al 
V Cuil jy 
Significance Level of r 


The significance level for r;; is based on 


which, under the null hypothesis, is distributed as a t with 1V;; — 2 degrees of freedom. By default, 
the significance level is two-tailed. 
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The CORRESPONDENCE algorithm consists of three major parts: 
1. A singular value decomposition (SVD) 
2. Centering and rescaling of the data and various rescalings of the results 


3. Variance estimation by the delta method. 


Other names for SVD are “Eckart-Young decomposition” after Eckart and Young (1936), who 
introduced the technique in psychometrics, and “basic structure” (Horst, 1963). The rescalings 
and centering, including their rationale, are well explained in Benzécri (1969), Nishisato (1980), 
Gifi (1981), and Greenacre (1984). Those who are interested in the general framework of matrix 
approximation and reduction of dimensionality with positive definite row and column metrics 
are referred to Rao (1980). The delta method is a method that can be used for the derivation 

of asymptotic distributions and is particularly useful for the approximation of the variance of 
complex statistics. There are many versions of the delta method, differing in the assumptions 
made and in the strength of the approximation (Rao, 1973, ch. 6; Bishop et al., 1975, ch. 14; 
Wolter, 1985, ch. 6). 

Other characteristic features of CORRESPONDENCE are the ability to fit supplementary 
points into the space defined by the active points, the ability to constrain rows and/or columns to 
have equal scores, and the ability to make biplots using either chi-squared distances, as in standard 
correspondence analysis, or Euclidean distances. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


ty Total number of rows (row objects) 

S1 Number of supplementary rows 

ky Number of rows in analysis (f1 — 1) 

to Total number of columns (column objects) 
S2 Number of supplementary columns 

ko Number of columns in analysis (t2 — $2) 
Pp Number of dimensions 


fij Nonnegative data value for row i and column j: collected in table F 
fi Marginal total of rowi, i= 1, ..., 41 
S43 Marginal total of column j, 7 = 1, ..., Ar 


N Grand total of F 


CORRESPONDENCE 


Scores and statistics: 
Vis Score of row object i on dimension s 
Cjs Score of column object j on dimension s 


I Total inertia 


Basic Calculations 


One way to phrase the CORRESPONDENCE objective (cf. Heiser, 1981) is to say that we wish 
to find row scores {7;;,} and column scores {c;,} so that the function 


a({ris} 3 {cjs}) = > > ee (ie _ Cis)” 
a G s 
is minimal, under the standardization restriction either that 
> fi LTEsTit = ot 
i 
or 
j 
where 6°! is Kronecker’s delta and t is an alternative index for dimensions. The trivial set of 


scores ({1},{1}) is excluded. 


The CORRESPONDENCE algorithm can be subdivided into five steps, as explained below. 


Data scaling and centering 


When rows and/or columns are specified to be equal, first the frequencies of the rows/columns 
to be equal are summed. The sums are put into the row/column with the smallest row/column 
number and the other rows/columns are set to zero. 


Measure is Chi Square 


The first step is to form the auxiliary matrix Z with general element 


The standardization with Chi Square measure is always rcmean (both row and column means 
removed. 


Measure is Euclidean 


When Euclidean measure is chosen, the auxiliary matrix Z is formed in two steps: 


Jit+Js4 


» 25 = Si - No 
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With f5. ff. and f>~ depending on the standardization option. 

rmean (remove row means). f= fi;: f= fix: £2; a 

cmean (remove column means). f= f,;: {> E: Ff = fe 

rcmean (remove both row and column means).f> = fi;: ff. = fiz: £f5 = fe; 

rsum (equalize row totals, then remove row means). fF = fife fr _ Nf = = 


fiz ? 24 key 


csum (equalize column totals, then remove column means). f;; = a ip = £: ft; = £ 


> Then. if not computed yet in the previous step, f~ . or/and f>; are computed: 


~ N ~ N ‘ j 
fz ki? fF; ka? and “ij Ps kn 


Singular value decomposition 


When rows and/or columns are specified as supplementary, first these rows and/or colums of Z 
are set to zero, yielding Z 


Let the singular value decomposition of Z be denoted by 
Z=KAL’ 


with KK =I, LL =I, and A diagonal. This decomposition is calculated by a routine based 
on Golub and Reinsch (1971). It involves Householder reduction to bidiagonal form and 
diagonalization by a QR procedure with shifts. The routine requires an array with more rows 
than columns, so when fi, < fz the original table is transposed and the parameter transfer is 
permuted accordingly. 


Adjustment to the row and column metric 


The arrays of both the left-hand singular vectors and the right-hand singular vectors are adjusted 
row-wise to form scores that are standardized in the row and in the column marginal proportions, 


respectively: 
Fis = biel Vf 5.1N, 
Cis =1js/ Vv fr; /N. 


This way, both sets of scores satisfy the standardization restrictions simultaneously. 


Determination of variances and covariances 


For the application of the delta method to the results of generalized eigenvalue methods under 
multinomial sampling, the reader is referred to Gifi (1990, ch. 12) and Israéls (1987, Appendix 
B). It is shown there that N time variance-covariance matrix of a function @ of the observed cell 
proportions p = {pj; = fg /N | asymptotically reaches the form 
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; e Oo de \' Oo Oo 
Nx cout oi) (5 ) (; ) Ly Mia. > i 5 
F 7 Piz Opi) F j YPij ; 3 9Pij 


Here the quantities 7;; are the cell probabilities of the multinomial distribution, and 0¢/Op;; are 
the partial derivatives of @ (which is either a generalized eigenvalue or a generalized eigenvector) 
with respect to the observed cell proportion. Expressions for these partial derivatives can also 

be found in the above-mentioned references. 


Normalization of row and column scores 


Depending on the normalization option chosen, the scores are normalized. The normalization 
option q can be chosen to be any value in the interval [-1,1] or it can be specified according to 
the following designations: 


0, symmetrical 
q=4 1, — rowprincipal 
—1, column principal 


There is a fifth possibility, choosing the designation “principal”, that does not correspond to 
a q-value. 


When “principal” is chosen, normalization parameters o for the rows and B for the columns are 
both set to 1. When one of the other options is chosen, o and B are functions of q: 


a = (1+q)/2 
B = (1-q)/2 


The normalization implies a compensatory rescaling of the coordinate axes of the row scores 
and the column scores: 


7 a ¥ 
Fis — Tigrs 


aie, eg 3 
Cis = CsA. 


The general formula for the weighted sum of squares that results from this rescaling is 
row scores: = frr2, = N22 


u 
column scores: oe foc awag 
J 


The estimated variances and covariances are adjusted according to the type of normalization 
chosen. 


Diagnostics 


After printing the dataj CORRESPONDENCE optionally also prints a table of row profiles and 
column profiles, which are { f;;/f;.} and {f;;/ f.;}, respectively. 
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Singular Values, Maximum Rank and Inertia 


All singular values \, defined in the second step are printed up to a maximum of 

min {(&, — 1), (42 — 1)}. Small singular values and corresponding dimensions are suppressed 
when they don’t exceed the quantity (/:)/2)'/"10~"; in this case a warning message is issued. 
Dimensionwise inertia and total inertia are given by the relationships 


‘ tpeeay 
ry = » Ds < 


iS i 


where the right-hand part of this equality is true only if the normalization is row principal (but 
for the other normalizations similar relationships are easily derived from step 5). The quantities 
“proportion explained” are equal to inertia divided by total inertia: \2/J. 


Supplementary Points 


Supplementary row and column points are given by 


s aq Ia— 
way. ee 
is / ok Ss 
Pgh 
J 


Mass, Scores, Inertia and Contributions 


The mass, scores, inertia and contributions for the row and columns points (including 
supplementary points) are given in the Overview Row Points Table and the Overview Column 
Points Table. These tables are printed in p dimensions. The tables are given first for rows, then 
for columns. The masses are the marginal proportions (f;; /.V and f>;/V, respectively). The 
inertia of the rows/columns is given by: 


For supplementary points, the contribution to the inertia of dimensions is zero. The contribution 
of the active points to the inertia of each dimension is given by 


2 
Vis 


N26 


Tig. = 


ce 
js 


N 
z fei 
Tis = W 


The contribution of dimensions to the inertia of each point is given by 
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ee 
Cis = N 


fa Coa 
%js= NO 


Confidence Statistics of Singular Values and Scores 


The computation of variances and covariances is explained in step 4. Since the row and column 
scores are linear functions of the singular vectors, an adjustment is necessary depending on the 
normalization option chosen. From these adjusted standard deviations and correlations are derived 
in the standard way. 


Permutations of the Input Table 
For each dimension s, let (‘|s) be the permutation of the first t; integers that would sort the 
sth column of {7;,}in ascending order. Similarly, let (j|s)be the permutation of the first t2 


integers that would sort the sth column of {c;,} in ascending order. Then the permuted data 
matrix is given by { f,(;).),p(j)s) }- 
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Cox (1972) first suggested the models in which factors related to lifetime have a multiplicative 
effect on the hazard function. These models are called proportional hazards models. Under the 
proportional hazards assumption, the hazard function h of t given X is of the form 


h(t|x) = ho(t)e xB 


where x is a known vector of regressor variables associated with the individual, 3 is a vector of 
unknown parameters, and /y(t) is the baseline hazard function for an individual with x = 0. 
Hence, for any two covariates sets x, and xy, the log hazard functions /)(¢|x,;) and /(t|x2) should 
be parallel across time. 


When a factor does not affect the hazard function multiplicatively, stratification may be useful in 
model building. Suppose that individuals can be assigned to one of m different strata, defined 
by the levels of one or more factors. The hazard function for an individual in the jth stratum is 
defined as 


hj (t|x) = ho; (t)eX 6 


There are two unknown components in the model: the regression parameter 3 and the baseline 
hazard function /y ;(¢). The estimation for the parameters is described below. 


Estimation 


We begin by considering a nonnegative random variable T representing the lifetimes of individuals 
in some population. Let f(t|x) denote the probability density function (pdf) of T givena regressor 
x and let S(t|x) be the survivor function (the probability of an individual surviving until time 

t). Hence 


stabs) = f f(u|x)du 
t 


The hazard h(t|x) is then defined by 


n(x) = Sts} 
Another useful expression for in terms of is 
S(t|x) h(t|x) 
t 
S(t|x) = exp (-/ h(ulx)au 
0 
Thus, 


t 
In S(t|x) = -[ h(u|x)du 
0 


For some purposes, it is also useful to define the cumulative hazard function 


t 
H(t\x) = i h(u|x)du = —In S(t|x) 
0 
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Under the proportional hazard assumption, the survivor function can be written as 
S(t|x) = [So(t)]*? (= 8) 
where S,(t) is the baseline survivor function defined by 


So(t) = exp (—Ho(t)) 


and 


t 
H(t) = fF houjdu 
0 


Some relationships between S(t|x), H(t|x) and /7o(t), So(¢) and ho(t) which will be used later are 
In S(t|x) = —H(t|x) = —exp (x's) Fo(t) 


In(—In S(t|x)) = x 3 + In Ho(t) 


To estimate the survivor function $(t|x), we can see from the equation for the survivor function 
that there are two components, 3 and S,(t), which need to be estimated. The approach we use 
here is to estimate 3 from the partial likelihood function and then to maximize the full likelihood 
for So(t). 


Estimation of Beta 


Assume that 
m There are m levels for the stratification variable. 
m™ Individuals in the same stratum have proportional hazard functions. 


m= The relative effect of the regressor variables is the same in each stratum. 


Let tj; < -+- < tj, be the observed uncensored failure time of the; individuals in the jth 
stratum and 2 ;;,...,: v;,, be the corresponding covariates. Then the partial likelihood function is 
defined by 


a 
) wer *? 


TER; 


where d;; is the sum of case weights of individuals whose lifetime is equal to ¢;; and $j; is 
the weighted sum of the regression vector x for those d;; individuals, w, is the case weight of 
individual |, and fj; is the set of individuals alive and uncensored just prior to ¢;; in the jth 
stratum. Thus the log-likelihood arising from the partial likelihood function is 


m kj mm key F 
1=InL(8) => >°8 448 -S> > dyin ( >> wie* : 


j=1 i=1 j=1 i=1 l€ Rj; 
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and the first derivatives of | are 


/ 


) WX pe~ 


S\"” is the rth component of Sj; = (s\) er Se) . The maximum partial likelihood estimate 


(MPLE) of 3 is obtained by setting ;* equal to zero for r = 1,..., p, where p is the number of 
independent variables in the model. The equations ae =0 (r=1,...,p) can usually be 
solved by using the Newton-Raphson method. 

Note that from its equation that the partial likelihood function 1(,3) is invariant under 
translation. All the covariates are centered by their corresponding overall mean. The overall mean 
of a covariate is defined as the sum of the product of weight and covariate for all the censored and 
uncensored cases in each stratum. For notational simplicity, x, used in the Estimation Section 


denotes centered covariates. 


Three convergence criteria for the Newton-Raphson method are available: 


m Absolute value of the largest difference in parameter estimates between iterations (0) divided 
by the value of the parameter estimate for the previous iteration; that is, 


BCON = 


5 
parameter estimate for previous iteration 


m Absolute difference of the log-likelihood function between iterations divided by the 
log-likelihood function for previous iteration. 


= Maximum number of iterations. 


The asymptotic covariance matrix for the MPLE 3 = (3 eer 8, | is estimated by I~ Where Iis 
the information matrix containing minus the second partial derivatives bff, The (r, s)-th 
element of I is defined by 


2 
I,, =-Eggpz nL 
09,08 
| he. ho fat | 
k WIXI y Lp e* a WL] p-e~ ne W1s€* te 
lER;i le Rj; le Rj; 
=> 


' 2 
jeli=l ) wye™ ! os 


IER;; 


We can also write I in a matrix form as 


hej 


Is = do die" (ty) ) Vee) 


j=l i=l 


where x(f;;) is aj; x p matrix which represents the p covariate variables in the model evaluated 
at time t;;, nj; is the number of distinct individuals in Rj;, and V(t;;) isanj; x nj; matrix with 
the Ith diagonal element 1;(¢;;) defined by 
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vu(tji) = pr(tji)wi — (wrpr(t ji)” 
exp (x'13) 


x Wh Xp (x 7) 
hERji 


and the (I, k) element v;;,(/;;) defined by 


pi(tji) 


vie (t ji) = wipi(t ji) x weDK(t ji) 


Estimation of the Baseline Function 


After the MPLE 3 of 3 is found, the baseline survivor function S; );(t) is estimated separately for 
each stratum. Assume that, fora stratum, ¢; < --- < t;, are observed lifetimes in the sample. 
There are n; at risk and d; deaths at ¢;, and in the interval [/;_,,/;) there are \; censored times. 
Since S,(t) is a survivor function, it is non-increasing and left continuous, and thus So(t) must be 
constant except for jumps at the observed lifetimes ¢,,..., /;. 

Further, it follows that 


So(ti) =1 

and 

So(tit) = So(tis1) 

Writing Sy(t;+) = p;(i = 1,...,k), the observed likelihood function is of the form 


i ll Il ome) aie) ial (sy Ul Cue 


i=1 UeD, lEC; epee 


where D; is the set of individuals dying at t; and C’; is the set of individuals with censored times in 
[t;-1, ¢;). (Note that if the last observation is uncensored, C),,, is empty and p; = 0) 


If we let a; = p;/p;-1(i = 1,...,/), £1 can be written as 


= . exp (x'13) re wy exp (x'13) 
Ly =ITTI (-e ) fi Q,; 


Differentiating In , with respect to a,,...,«, and setting the equations equal to zero, we get 


Ww) exp (x'.3) : 

S a = ) wrexp (x'13) VS Deteck 
— exp (x'1/3) : 

lED; l—a, IER; 


We then plug the MPLE 3 of 3 into this equation and solve these k equations separately. 


There are two things worth noting: 


m If any |D;| = 1, 4; can be solved explicitly. 
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exp ( —x' 3) 
w; exp (x’ iB) 
es Ww exp (x'3) 
1ER; 
m If |D;| > 1, the equation for the cumulative hazard function must be solved iteratively for 
a;. A good initial value for 4; is 


a,= 1 


—d; 


Q; = exp t - 
S w) exp (x 13) 
IER; 
where d; = 3 w, is the weight sum for set D;. (See Lawless, 1982, p. 361.) 
leD; 
Once the 4,;, i=1,...,k are found, So(t) is estimated by 


a:(ti<t) 


Since the above estimate of y(t) requires some iterative calculations when ties exist, Breslow 
(1974) suggests using the equation for a; when |D;| > 1 as an estimate; however, we will use 
this as an initial estimate. 

The asymptotic variance for — In Sy(t) can be found in Chapter 4 of Kalbfleisch and Prentice 
(1980). At a specified time t, it is consistently estimated by 


=) 
var(—In$o(t)) = SS DiI & wy exp («:3)) -al-'a 


ti<t IER; 


where a is a px1 vector with the jth element defined by 


y WIL]; exp (x13) 
p.\ ek: 
>| i| wee. 
Ds wy exp (x 13) 


IER; 


and I is the information matrix. The asymptotic variance of S(¢{1°) is estimated by 


e2x B (S(t|x)) var (- In Sol )) 


Selection Statistics for Stepwise Methods 


The same methods for variable selection are offered as in binary logistic regression. For more 
information, see the topic “Stepwise Variable Selection”. Here we will only define the three 
removal statistics—Wald, LR, and Conditional—and the Score entry statistic. 
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Score Statistic 


The score statistic is calculated for every variable not in the model to decide which variable should 
be added to the model. First we compute the information matrix I for all eligible variables based 
on the parameter estimates for the variables in the model and zero parameter estimates for the 
variables not in the model. Then we partition the resulting I into four submatrices as follows: 


Ai Ai 
Ao, A 


where Aj; and A» are square matrices for variables in the model and variables not in the model, 
respectively, and A,» is the cross-product matrix for variables in and out. The score statistic 
for variable x; is defined by 


D'»,B22,:Dz, 


where D,, is the first derivative of the log-likelihood with respect to all the parameters associated 
with x; and Bo; is equal to (Ago; — Aoi,i;Aj;' Ai2;) ~* and Aoo; and Ajo; are the submatrices 
in Ao» and Aj» associated with variable x;. 


Wald Statistic 


The Wald statistic is calculated for the variables in the model to select variables for removal. 
The Wald statistic for variable x; is defined by 


Py , 
2 
5;Binj 3; 


where 8; is the parameter estimate associated with x; and B;;,; is the submatrix of Aj) associated 
with x je 


LR (Likelihood Ratio) Statistic 


The LR statistic is defined as twice the log of the ratio of the likelihood functions of two models 
evaluated at their own MPLES. Assume that r variables are in the current model and let us call the 
current model the full model. Based on the MPLES of parameters for the full model, I(full) is 
defined in “Estimation of Beta ”. For each of r variables deleted from the full model, MPLES 

are found and the reduced log-likelihood function, I(reduced), is calculated. Then LR statistic is 
defined as 


—2(I(reduced) — I(full)) 


Conditional Statistic 


The conditional statistic is also computed for every variable in the model. The formula for 
conditional statistic is the same as LR statistic except that the parameter estimates for each 
reduced model are conditional estimates, not MPLES. The conditional estimates are defined as 
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follows. Let 3 = (3 eee B, J be the MPLES for the r variables (blocks) and C be the asymptotic 


covariance for the parameters left in the model given {3; is 
en | elt co eo 
OG) = Pa) — Vie 22 Mi 


where (3; is the MPLE for the parameter(s) associated with x; and 3;,) is 3 without (;, Cre 5 


the covariance between the parameter estimates left in the model 3,,, and 3;, and C’\’) is the 
p (2) Mt 22 


covariance of 33;. Then the conditional statistic for variable x; is defined by 
—2(1(b,;)) —I(full)) 
where (3, F ) is the log-likelihood function evaluated at By). 


Note that all these four statistics have a chi-square distribution with degrees of freedom equal to 
the number of parameters the corresponding model has. 


Statistics 


The following output statistics are available. 


Initial Model Information 


The initial model for the first method is for a model that does not include covariates. The 
log-likelihood function | is equal to 


m hj 
(0) = SOS dj In (nj) 
j=li=l 
where 7}; is the sum of weights of individuals in set Fj;. 
Model Information 
When a stepwise method is requested, at each step, the —2 log-likelihood function and three 


chi-square statistics (model chi-square, improvement chi-square, and overall chi-square) and their 
corresponding degrees of freedom and significance are printed. 


-2 Log-Likelihood 


m ky 7 
-25° > s iB —djiln yo wy exp ) 


j=li=l1 lER;i 


where 3 is the MPLE of 3 for the current model. 
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Improvement Chi-Square 
(—2 log-likelihood function for previous model) — ( —2 log-likelihood function for current model). 


The previous model is the model from the last step. The degrees of freedom are equal to the 
absolute value of the difference between the number of parameters estimated in these two models. 


Model Chi-Square 
(—2 log-likelihood function for initial model) — (—2 log-likelihood function for current model). 
The initial model is the final model from the previous method. The degrees of freedom are equal 
to the absolute value of the difference between the number of parameters estimated in these 


two model. 


Note: The values of the model chi-square and improvement chi-square can be less than or equal to 
zero. If the degrees of freedom are equal to zero, the chi-square is not printed. 


Overall Chi-Square 


The overall chi-square statistic tests the hypothesis that all regression coefficients for the variables 
in the model are identically zero. This statistic is defined as 


u (0)I~!u(0) 


where u(()) represents the vector of first derivatives of the partial log-likelihood function evaluated 
at 3 — Q. The elements of u and I are defined in “Estimation of Beta ”. 


Information for Variables in the Equation 


For each of the single variables in the equation, MPLE, SE for MPLE, Wald statistic, and its 
corresponding df, significance, and partial R are given. For a single variable, R is defined by 


1/2 
a Wald_o Z 
a E log-likelihood for the intial aaa SenOL MELE 


if Wald > 2. Otherwise R is set to zero. For a multiple category variable, only the Wald statistic, 
df, significance, and partial R are printed, where R is defined by 


9 


R 


Wald—2«df 
~ | —2 log-likelihood for the intial model 


if Wald > 2df. Otherwise R is set to zero. 
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Information for the Variables Not in the Equation 


For each of the variables not in the equation, the Score statistic is calculated and its corresponding 
degrees of freedom, significance, and partial R are printed. The partial R for variables not in the 
equation is defined similarly to the R for the variables in the equation by changing the Wald 
statistic to the Score statistic. 

There is one overall statistic called the residual chi-square. This statistic tests if all regression 
coefficients for the variables not in the equation are zero. It is defined by 


Ww (3)Boou(3) 


where u( 3) is the vector of first derivatives of the partial log-likelihood function with 


respect to all the parameters not in the equation evaluated at MPLE 3 and B»» is equal to 
(Ao: — Ay, Aj Ags) and A is defined in “Score Statistic ”. 
Survival Table 


For each stratum, the estimates of the baseline cumulative survival (59) and hazard (Hy) function 
and their standard errors are computed. H(t) is estimated by 


Ho(t) = —In S(t) 


and the asymptotic variance of H(t) is defined in “Estimation of the Baseline Function ”. Finally, 
the cumulative hazard function H(t|x) and survival function $(¢|x) are estimated by 


H(t|x) = exp (x'3) H(t) 
and, for a given x, 


exp (x0) 


S(t|x) = (S0(0) 
The asymptotic variances are 
var (H1(t\x)) = exp (2x'3) var (Ho(t)) 


and 


var (S(t\x) ) = exp (2x’ 3) (S(¢}x)) var ( H0(0)) 


Diagnostic Statistics 


Three casewise diagnostic statistics, Residual, Partial Residual, and DFBETAs, are produced. 
Both Residual and DFBETA are computed for all distinct individuals. Partial Residuals are 
calculated only for uncensored individuals. 

Assume that there are 7; subjects in stratum j and /:; distinct observed events #1 < --: < fj. 
Define the selected probability for the Ith individual at time ¢; as 
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exp (x a(t: )3) 


a, wh @Xp (x’a(t:)3) 


pilti)= eR, 


if th individual is in R; 


0 otherwise 


and 


uy = a, d; [pi(ti) = pi (ti) 
i=l 
1 if jth individual is in D; 
w(t) = 4 otherwise 
hej 
y= ye [yi(ti) aes dipi(ti )] 


i=1 


DFBETA 


The changes in the maximum partial likelihood estimate of beta due to the deletion of a single 
observation have been discussed in Cain and Lange (1984) and Storer and Crowley (1985). The 
estimate of DFBETA computed is derived from augmented regression models. The details can be 
found in Storer and Crowley (1985). When the /th individual in the jth stratum is deleted, the 
change A3, is estimated by 


ABS; = —4]-lyr, 
m 
where 
w= diag(w,. ity ri32) 


my = uy — va! vy 


and x’ (¢;) is an nj; x p matrix which represents the p covariate variables in the model evaluated at 
(;, and nj; is the number of individuals in FR ;;. 


Partial Residuals 


Partial residuals can only be computed for the covariates which are not time dependent. At time 
¢; in stratum j, 2, is the px1 observed covariate vector for any gth individual in set D;, where 
D; is the set of individuals dying at t;. The partial residual +,, is defined by 
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Rewriting the above formula in a univariate form, we get 


) wy Lip, EXP (x 13) 


IER; 


S W] eXp (x'13) 


IER; 


h=1,...,p.g €D,; 


% gh = Ugh 


where z,;, is the hth component for x. For every variable, the residuals can be plotted against 
times to test the proportional hazards assumption. 


Residuals 


The residuals e; are computed by 


ej = H(t; |x;) = exp (x’:8) (10(t:)) 


which is the same as the estimate of the cumulative hazard function. 


Plots 
For a specified pattern, the covariate values x,. are determined and x,. is computed. There are three 
plots available for Cox regression. 
Survival Plot 
For stratum j, G Soft: 5)'), i=1,...,h, are plotted where 
~ a exp (x, 3) 
S(ti|Xe) = (So(¢s)) 
When PATTERN(ALL) is requested, for every uncensored time ¢; in stratum j the survival 
function is estimated by 
k k ' 
7 ; J A exp (x .3 
So wi S(ti Xc) WI (So(t:)) \ 
S(t;) = =! - — I=1 : 
ae WI Se WwW] 
I=1 I=1 
Then (4:,5(4)), i=1,...,h,; are plotted for stratum j. 
Hazard Plot 


For stratum j, (ti, H(ti)xe)); i=1,...,h; are plotted where 


H(t; 


X,) = exp (x'.8) Ho(ti) 
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LML Plot 


The log-minus-log plot is used to see whether the stratification variable should be included as 
a covariate. For stratum j, ( ti,X -8 +1n Holt; ) i=1,...,h, are plotted. If the plot shows 
parallelism among strata, then the stratum variable should be a covariate. 


References 
Breslow, N. E. 1974. Covariance analysis of censored survival data. Biometrics, 30, 89-99. 


Cain, K. C., and N. T. Lange. 1984. Approximate case influence for the proportional hazards 
regression model with censored data. Biometrics, 40, 493-499. 


Cox, D. R. 1972. Regression models and life tables (with discussion). Journal of the Royal 
Statistical Society, Series B, 34, 187-220. 


Kalbfleisch, J. D., and R. L. Prentice. 2002. The statistical analysis of failure time data, 2 ed. 
New York: John Wiley & Sons, Inc. 


Lawless, R. F. 1982. Statistical models and methods for lifetime data. New York: John Wiley & 
Sons, Inc.. 


Storer, B. E., and J. Crowley. 1985. A diagnostic for Cox regression and general conditional 
likelihoods. Journal of the American Statistical Association, 80, 139-147. 


CREATE Algorithms 


CREATE produces new series as a function of existing series. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table 22-1 
Notation 
Notation Description 
Existing Series DS eee? Le 
New Series b Sperecane oe 


Cumulative Sum (CSUM(X)) 


J 
Y= SOX; fHly...yn 


i=] 


Differences of Order m (DIFF(X,m)) 


Define 


Lik) HLA R= 1) HRA) RS Laee, 


~ 


i JH heh 
with 
Zj(O)=Xj g=l,....n 
then 
} =\aq jemsticgn 
: SYSMIS _ otherwise 


Lag of Order m (LAG(X,m)) 


oe Xj—m g=mt+l,...,n 
J“ ) SYSMIS j=1,...,m 


Lead of Order m (LEAD(X,m)) 


Wes Xj4m oad ere toe 7 
J“) SYSMIS j=n-—m+t1,...,n 


155 
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Moving Average of Length m (MA(X,m)) 


If m is odd, define 


then 
q 
= 2 jg=qtl,...,n—-@q 
SYSMIS otherwise 


If mis even, define g = m/2 and 


q 
Z; = ys Xjee/M J=Q---, n—q 
k 


h=—qt+] 
then 
A (Zj-1 + Z;)/2 jg=qtl,....m-@ 
J SYSMIS otherwise 


Running Median of Length m (X,m) 


If m is odd, 


— m-l 
q= 9 


v= eh eae ne meee a j=qtl,... 
Jj 


SYSMIS — otherwise 


If m is even, define 


Ay S MEM X gigas oe Age Adige Agtg) FP =H Gyre He 


then 

Pew: (Zj-1+2;)/2 joqtl,....n-@ 

J =) SYSMIS otherwise 
where 

sear (1) if k is odd 
median(ay,.--,@4) = la iyi oe 


_ f(k+1)/2. if k isodd 
= k/2 if k is even 
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and ay) < aig) < ... < ay) is the ordered sample of a,,..., a). 


Seasonal Differencing of Order m and Period p 
(SDIFF(X,m,p)) 


Define 


Z;(k) = Zj(k = 1) —Zj-p(k-—1) k= 1,...,m J=pk+1,...,0 


then 


Y;=Zj(m) j=mp+l1,...,n 


The T4253H Smoothing Function (T4253H(X)) 


The original series is smoothed by a compound data smoother based on (Velleman, 1980). The 
smoother starts with: 


—E A running median of 4: 
Let Z be the smoothed series, then 


2541/2 = median(Xj-1,Xj,Xj41,Xj+2) J=2,...,.n-—2 
and 


ZY) =e zy? = median( X1, X2) = $(Xq + X92) 


(1) 
n+1/2 


Ft median( Xn-1, Xn) = 4(Xn-1 +Xn) Z 


n—-1/2 


= Xn 
—E A running median of Z: 
(1) r7(1) 
Z; = 20.5 Zy = Ly, | 1/2 
and 


Gere A eT po) i pee eee | 


(1) (1) 


— Arunning medianof5on Z,’,..., 2, from the previous step: 


Let Z‘?) be the resulting series, then 
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(2) (1) (2) (1) 
Z) =Z, Zr = Zp 


PD ch mae” Gh eee 
= median (7; Ve aes ) 


(2) Lage SO. - eC weet) 
ZY) = median Car see ) 


and 
(2) a) (1) (1) (1) (1) (1) é 
Z; = median (Bs 45-194; 2, 4; ») oe 
E A running median of 3 on VA gerd from the previous step: 


Let Z‘*) be the resulting series, then 


J 


3 5(2) -@2) 42 
Z°) = median (2), 2! a, ey $= 2,9,.2050 = 1 
3 : ‘ 3 j 2 
Zi pis median (323 hes 


(3) 2) 


7(3) . a3) 9° 7 (3) 
Zn =median (32), = 220 532m Za) 


—E Hanning (Running Weighted Averages): 


E Residual: 


D,=X;- 29 i=1,...," 
E Repeat the previous steps on the residuals D,,...,D,: 


4) 


E Let D\”,...,D\.) be the final result. 


E Final smooth: 


i 4 ; 
Yj = Z) h4 De (EE 7 


...n 2 


Prior Moving Averages of Length m (PMA(X,m)) 


v= 9 Agim tom+1,.047 
aa . 


g=i-m 


SYSMIS i=1,...,m 
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Fast Fourier Transform (FFT(X)) 


The discrete Fourier transform of a sequence X = {X),..., X,,} is defined as 


n 
Ye =i5S Xx exp {—i27 f(t — 1)} 


t=1 
=1Y~ Xj[cos (27 fy(t — 1)) — isin (27 f,(t — 1))| 
t=1 
me One PR) a7 X; sin (27 f,(t — 1)) 
= ap + iby 


Thus a, b are two sequences generated by FFT and they are called real and imaginary, respectively. 
ap = BY Xecos(2s t= 1). =i 


by = Bie: sin(2rf,(t—1)) k=1,...,7 


f=1 

where 

_ J (n—1)/2 if n isodd 
v— ne if n iseven 
and 
ag = 

I. 

bo = ceo r(t — 1)) 


Inverse Fast Fourier Transform of Two Series (IFFT(a,b)) 


The inverse Fourier Transform of two series {a, b} is defined as 


qd 


X; = ap — bg cos (a(t — 1)) + 2 S- ax cos ( (27 f;.(t — 1)) oh sin (27 f;,.(t — 1)) 


k=1 


References 
Brigham, E. O. 1974. The fast Fourier transform. Englewood Cliffs, N.J.: Prentice-Hall. 


CREATE 


Monro, D. M. 1975. Algorithm AS 83: Complex discrete fast Fourier transform. Applied 
Statistics, 24, 153-160. 


Monro, D. M., and J. L. Branch. 1977. Algorithm AS 117: The Chirp discrete Fourier transform 
of general length. Applied Statistics, 26, 351-361. 


Velleman, P. F. 1980. Definition and comparison of robust nonlinear data smoothing algorithms. 
Journal of the American Statistical Association, 75, 609-615. 


Velleman, P. F., and D. C. Hoaglin. 1981. Applications, basics, and computing of exploratory 
data analysis. Boston, Mass.: Duxbury Press. 


CROSSTABS Algorithms 


The notation and statistics refer to bivariate subtables defined by a row variable X and a column 
variable Y, unless specified otherwise. By default, CROSSTABS deletes cases with missing 


values on a table-by-table basis. 


Distinct values of row variable arranged in ascending order: 


Distinct values of column variable arranged in ascending order: 


Notation 
The following notation is used throughout this section unless otherwise stated: 
Table 23-1 
Notation 
Notation Description 
Xj; 
Xi, < Xo< +++ < XR 
Y; 
¥.<Yo<-::- <Y¥o 
fi Sum of cell weights for cases in cell (i, 7) 
Cj R 
» fij, the jth column subtotal 
i=l 
ri c 
\~ f,;» the ith row subtotal 
L719 
j=l 
W 


Cc 
ey c= SS r;, the grand total 


j=l i=1 


Marginal and Cell Statistics 


Count 
count = fi; 


Expected Count 
racy 
W 


Row Percent 


Bij = 


row percent = 100 x (fi;/ri) 
Column Percent 

column percent = 100 x (fi;/c;) 
Total Percent 


total percent = 100 x (fi;/W) 
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Residual 
Rij = fig — Big 
Standardized Residual 


Rij 


[Bi 


SRij = 
\ 


Adjusted Residual 


Chi-Square Statistics 
Pearson’s Chi-Square 
2 (fij _ Ei) 

Xp = >» E:: 

v8 ~ti) 

7) , 
The degrees of freedom are (? — 1)(C' — 1). 
Likelihood Ratio 
Xin = 2)>_ fig ln (fig/Eas) 

iy 

The degrees of freedom are (R — 1)(C — 1). 


lim 


Note: when f;; = 0, the entire term/;; In (/;;//;;) is treated as 0, because ee g Mes (n) =0, 


and thus has no effect on the sum. 
Fisher’s Exact Test 


If the table is a 2 x 2 table, not resulting from a larger table with missing cells, with at least one 
expected cell count less than 5, then the Fisher exact test is calculated. For more information, see 
the topic “Significance Levels for Fisher’s Exact Test”. 


Yates Continuity Corrected for 2 x 2 Tables 


T17T2C1Co 


2 { W (|frrfoo—fi2 forl-—0.5W )” if| f nee _ fiz for|>0.5W 
0 otherwise 


The degrees of freedom are 1. 
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Mantel-Haenszel Test of Linear Association 


Vier = (W = 1)r? 


where r is the Pearson correlation coefficient to be defined later. The degrees of freedom are 1. 


Other Measures of Association 
Phi Coefficient 


For a table not 2 x 2 


Pn 
ro VW 


For a 2 x 2 table only, 7 is equal to the Pearson correlation coefficient so that the sign of 
y matches that of the correlation coefficients. 


Coefficient of Contingency 


Cramér’s V 


/9 


: 1/2 
a %e4 
(wes) 


where g = min{R,C}. 


Measures of Proportional Reduction in Predictive Error 
Lambda 


Let fim and fi; be the largest cell count in row i and column j, respectively. Also, let,, be 
the largest row subtotal and c,,, the largest column subtotal. Define \,-|, as the proportion of 
relative error in predicting an individual’s Y category that can be eliminated by knowledge of 
the X category. \,-\, is computed as 


R 
fim — Cm 
i=1 


= 


Xr ee =. 
re W- Cm 


The standard errors are 
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W-cr 


5,21 ifj is column index for fim 
“J ) 0 otherwise 


5,31 if 7 is index for c,), 
“J ~~) 0 otherwise 


Lambda for predicting X from Y, yy, is obtained by permuting the indices in the above formulae. 


The two asymmetric lambdas are averaged to obtain the symmetric lambda. 


R Cc 
3 fim + s Inj — Cm — lm 
i=1 j=l 


2W —Tm — Cm 


‘— 


The standard errors are 


R C 3 R C 
Dee Fis (855 + 85 — BF)" — | | fim + DT fing em =m | /W 
i=1 j=l ame a 

ASE = 2W 1m —Cm 
R C . 
Yo fai lB + 88) — OF - 05 + ACRE + 85)? - 4? 

> t=1 j=1 
ASE, = 2W—rm—Cm 
where 


sr. — J 1 ifis row index for fm, 
~ ) 0 otherwise 
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sr = 1 ifitsindex for r,, 
/ ~~ 1 0 otherwise 


and where 
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jc, = 1 if 7 is column index for fi; 
“ii ~ 10 otherwise 


of 1 if 7 is index for ¢;, 
= : 
0 otherwise 


Goodman and Kruskal’s Tau 


Similarly defined is Goodman and Kruskal’s tau (7 (Goodman and Kruskal, 1954): 


C 
W 2 fR/ri) ee 


pe fer 1 ; 
ASE, = |43- fij§ (v- 8) fist c; | -W6 = pay — fj 
inj ‘j=1 i j=1 i 


in which 
C - C 
sw and o WL fh Dod 
j=l j=l 


Tx|y and its standard error can be obtained by interchanging the roles of X and Y. 


The significance level is based on the chi-square distribution, since 
y ¥ 2 
(W —1)(C —1)tyyx~ X(R-1)\(C-1) 
~ / 9 
(VV —1)(R- 1)rxy ~ X(R-1)(C-1) 


Uncertainty Coefficient 


Let Uy) be the proportional reduction in the uncertainty (entropy) of Y that can be eliminated by 
knowledge of X. It is computed as 


_  _ U(X) +U(Y) -U(XY) 
a UY) 


where 
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ce i viv! ta) 
= S Cy Cj 
010 ae yin (ee 


The asymptotic standard errors are 


: 2 


~ /P-W[U(X)+U(Y)-U(XY)/? 
AS Ey = : 


[WOU(Y)] 
where 


r=) delat) 


The formulas for U'y;, can be obtained by interchanging the roles of X and Y. 


A symmetric version of the two asymmetric uncertainty coefficients is defined as follows: 


pf U(X) + UY) — U(XY) 
pa TT 


with asymptotic standard errors 


. 2 "yy PiCj 1Y¥) LINY Ji, 
ASE, = WX Sai{ven yin (=F) —[U(X)+U(Y)]|In (Fe 
iJ 


\ = 
ASE, = aaa? — MIU x)+ Ut) U(x YI} 


Cohen’s Kappa 


Cohen’s kappa (x), defined only for square table (2 = C’), is computed as 
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R R 
WS> fii yo ric 
. i=l i=1 
ee R 
a zs = TG 
i=1 


with variance 


igen NE Are ee eee) 


(w2- Uric)” 
(w- Dfi)? Daze bey) = Were)" 
a. 


(w2- Ur;c;) 


vary = L 5 [ (ne bs rs] -w(S rici(rit a) 
(we Sone] a i i 


Kendall’s Tau-b and Tau-c 


| 


Define 


Li) 


Da WP =r 


=1 


3 
Dea We SG 

j=l 
Cg= >) fk te 


SN 


h<ik<y h>tik>j 
Dy => ee ne 
h<ik>j h>ik<y 


P= Stic 
ij 

Q= oy fijPig 
ig 
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Note: the P and Q listed above are double the “usual” P (number of concordant pairs) and Q 
(number of discordant pairs). Likewise, D, is double the “usual” P + @Q+ Xv, (the number of 
concordant pairs, discordant pairs, and pairs on which the row variable is tied) and D.. is double 
the “usual” P + Q + Yo (the number of concordant pairs, discordant pairs, and pairs on which 
the column variable is tied). 


Kendall’s Tau-b 
P-Q 

Tt = ——- 
5) Dal, 


with standard error 


, es 2 , 5 2 
ASE, = 455 > fa (2V D,DACiz — Di) 4 rovis) — W7?(D, + De) 
i,j 
where 
Vij = rzjDe+ cjD, 


Under the independence assumption, the standard error is 


a 1 
pC Day wee —~Q) 
AS Ep = 2\| 2 


D,Dz- 
Kendall’s Tau-c 


_ — al? —@) 
“~ W2(q—1) 


with standard error 


oi 2q ae 2 1 2 
ASE, = Ty 2 falCn Diy) GP) 


or, under the independence assumption, 


gin 2q oo Tae a 2 
AS Ep = (a= DW? 2 FalCn Dj;) ] AP Q) 


| 


where 


q=min{R,C} 
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Gamma 


Gamma (7) is estimated by 


..P=@Q 
PEO 


A 


with standard error 


‘ 4 ; 2 
ASE, = ———; ii(QC;; — PDi; 
= rye (QC — PDis) 


or, under the hypothesis of independence, 


a 2 ee ee ee 
ASE = BE gy LM Dy) (PS) 


Somers’ d 
Somers’ d with row variable X as the independent variable is calculated as 


F=0 
dy|x Seay 3 Tal 
‘ 


with standard error 


or, under the hypothesis of independence, 


1 


2 
wl -@) 


2 ee 
AS Ey = —— x fij (Cig — Dig) = 


By interchanging the roles of X and Y, the formulas for Somers’ d with X as the dependent 
variable can be obtained. 


Symmetric version of Somers’ d is 


(P —Q) 


i 
“* TD, + Dr) 


The standard error is 
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9 


205 
Ash 3" Dw, 
SGD) 


where 0°, is the variance of Kendall’s 7, 


Ds 


? 1 
AS Eo = aay tule Fis(C Tee Di;) SQ) 


Pearson’s r 
The Pearson’s product moment correlation r is computed as 


colx,Y) —S 
S(X)S(Y) TT 


where 
R C 

cou( = AGYG fig = es pea /W 
i=l j=1 


R R 2 
=S°X}ri- (> xin /W 
i=1 1=1 


The variance of r is 


ae = 2. 
var, = ADIT (MENS Y) ae, [es x) SY) (Gav) sx 


If the null hypothesis is true, 


De fa XPYP — | fa Xi%s | /W 
a ij 


. tJ 
varp = 
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where 
R 

X= X;7;/W 
i=. 

and 
C 

eS S¥jc)/V 
7=1, 


rJ/W —2 


is distributed as a t with W — 2 degrees of freedom. 


Spearman Correlation 


The Spearman’s rank correlation coefficient 7, is computed by using rank scores R; for X; and 
C; for Y;. These rank scores are defined as follows: 


Ri= J orp+(ri+l)/2  fori=1,2,...,8 
Cy =Soen t(j +1)/2 for 7=1,2,...,C 


The formulas for 7, and its asymptotic variance can be obtained from the Pearson formulas by 
substituting A; and C’; for X; and Y;, respectively. 


Eta 


Asymmetric 7 with the column variable Y as dependent is 


Syw 
S(Y) 


Wy = 1-— 


where 


2 
R 


a 
Syw => YP fy -—>_ f Pe Gi 
ij j=l 


i=l 
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Relative Risk 


Consider a 2 x 2 table (that is, = C = 2). Ina case-control study, the relative risk is estimated as 


The 100(1 —a)_ percent CI for the relative risk is obtained as 
| Ro exp (—21-a/2), Ro exp (1-2/2) 


where 


satya 
fu fio for fos 


The relative risk ratios in a cohort study are computed for both columns. For column 1, the risk is 


fir( for + foe) 
for( fir + fiz) 


Ry = 


and the corresponding 100(1 — a) percent CI is 
Ee exp (—zy_a/20), Riexp (21/20) 


where 


sa ( a foo ) 1/2 
. fir(fut fiz) — for(for + fo2) 


The relative risk for column 2 and the confidence interval are computed similarly. 


McNemar-Bowker’s Test 


This statistic is used to test if a square table is symmetric. 


Notations 
Table 23-2 
Notation 
Notation Description 
n Dimension of the table (both row and column) 
Pij Unknown population cell probability of row i and column j 


Nij Observed counts cell count of row i and column j 
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Algorithm 


Given a n x n square table, the McNemar-Bowker’s statistic is used to test the hypothesis 
Hy : pi; = pj; for all (i<j) v.s. Ay : pi; Ap); for at least one pair of (i,j). The statistic is 
defined by the formula 


I (ni + ny > 0) (nay — yi)? 
2 y 7) jt aj jt 


Nij + NG; 


Where I() is the indicator function. Under the null hypothesis, ,” has an asymptotic Chi-square 
distribution with n(n — 1) /2 degrees of freedom. The null hypothesis will be rejected if \? has a 
large value. The two-sided p-value is equal to 1 — F (n(n — 1) /2,\"), where F (df, ) is the 
CDF of Chi-square distribution with df degrees of freedom. 


A Special Case: 2x2 Tables 


For 2x2 table, the statistic reduces to the classical McNemar statistic (McNemar, 1947) for which 
exact p-value can be computed. The two-tailed probability level is 


min(712,721) 


ny2 +N: 9\Mi2tNe1 
2 ye ( 12 i Yaa” n 


i=0 


Conditional Independence and Homogeneity 


The Cochran’s and Mantel-Haenzel statistics test the independence of two dichotomous variables, 
controlling for one or more other categorical variables. These “other” categorical variables define 
a number of strata, across which these statistics are computed. 


The Breslow-Day statistic is used to test homogeneity of the common odds ratio, which is a weaker 
condition than the conditional independence (i.e., homogeneity with the common odds ratio of 

1) tested by Cochran’s and Mantel-Haenszel statistics. Tarone’s statistic is the Breslow-Day 
statistic adjusted for the consistent but inefficient estimator such as the Mantel-Haenszel estimator 
of the common odds ratio. 


Notation and Definitions 


Table 23-3 
Notation 
Notation Description 
K The number of strata. 
ijk Sum of cell weights for cases in the ith row of the jth column of the kth strata. 


Cyk R 
S- fijxs the jth column of the kth strata subtotal. 


i=1 
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Notation Description 
Tik Cc 
ye f,;,» the ith row of the kth strata subtotal. 
j=1 
ThE oo R 
he , the grand total of the kth strata. 


Fixx E( fijx) = —, the expected cell count of the ith row of the jth column of the kth strata. 


A stratum such that n; = 0 is omitted from the analysis. (K must be modified accordingly.) If 
n, = 0 for all k, then no computation is done. 


Preliminarily, define for each k 


dy = Pik — Prks 
Pr= nh? 


and 


Te fy 


wr, = 


Cochran’s Statistic 


Cochran’s statistic (Cochran, 1954) is 


K 
5 iti yn S- wedk 
C= k=1 k=1 : 
K K 
| So wr wWrPk (1 — pr) Dy Wk | De wWePk (1 — pr) 
\ k=1 \ k=1 


All stratum such that 7);, = 0 or ra; = 0 are excluded, because d), is undefined. If every stratum 
is such, C is undefined. Note that a stratum such that 7); > 0 and ra; > 0 but that c); = 0 or 
co = 0 is a valid stratum, although it contributes nothing to the denominator or numerator. 
However, if every stratum is such, C is again undefined. So, in order to compute a non-system 
missing value of C, at least one stratum must have all non-zero marginal totals. 


Alternatively, Cochran’s statistic can be written as 


K 


S- (fire — Eur) 


(paths 


S 
\ S— webs (1 = pr) 


k=1 
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When the number of strata is fixed as the sample sizes within each stratum increase, Cochran’s 


statistic is asymptotically standard normal, and thus its square is asymptotically distributed as a 
chi-squared distribution with 1 df. 


Mantel and Haeszel’s Statistic 
Mantel and Haenszel’s statistic (Mantel and Haenszel, 1959) is simply Cochran’s statistic with 
small-sample corrections for continuity and variance “inflation.” These corrections are desirable 


when r};,and rg; are small, but the corrections can make a noticeable difference even for relatively 
large r,;. and rz;(Snedecor and Cochran, 1980) (p. 213). The statistic is defined as: 


K 
Sot fitk — E\ix) —0. ‘. of SoU Fund} 
M= k=1 k=1 


PURT 2k 
by Sp (1 = px) 
\ —~np-1 
k=1 
where sgn is the signum function 


> 


lifz>0 
sgn(v) = ¢ Oifx=0 
—-lifz<0 


Any stratum in which n; = 1 is excluded from the computation. If every stratum is such, then 

M is undefined. M is also undefined if every stratum is such that rj); = 0, ra; = 0, ci, = 0, or 
C9; = 0. In order to compute a non-system missing value of M, at least one stratum must have all 
non-zero marginal totals, just as for C. 


When the number of strata is fixed as the sample sizes within each stratum increase, or when 

the sample sizes within each strata are fixed as the number of strata increases, this statistic is 
asymptotically standard normal, and thus its square is asymptotically distributed as a chi-squared 
distribution with 1 d.f. 


The Breslow-Day Statistic 
The Breslow-Day statistic for any estimator 6 is 


‘ {furs —E (frreleres 0) y 
k=1 Vv (frrelere: 6) 
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E and V are based on the exact moments, but it is customary to replace them with the asymptotic 
expectation and variance. Let £’ and V mean the estimated asymptotic expectation and the 
estimated asymptotic variance, respectively. Given the Mantel-Haenszel common odds ratio 
estimator 6,477, we use the following statistic as the Breslow-Day statistic: 


{furs -~E (fueleres Overt) Vy 
= V (fursleass Over) 


where 


’ 


E (frrelers: 6H) = fie 
satisfies the equations 


fiz (me —tin—ein+ fiir) 


(rip —fiik Jeu —fiie) = °MH> 


with constraints such that 


fiz 2 9, 

ik fitk > 0, 

Cle fuk > 0, 

Nk —Tik— Cik + fitk 2 9; 


and 


: x -1 
7 , pais —, 1 1 1 1 
: (f: a Lh: Ov) ~ (4 firn fous fork ) 


with constraints such that 


fiz >0, | 

fiok = "ik — frik > 9, 

fair = Cik — fiz > Oye 

foor = Mk —T1k — Cik + Fisk > 9; 


All stratum such that 7}; = 0 or cj; = 0 are excluded. If every stratum is such, B is undefined. 
Stratum such that /;;;, = 0 are also excluded. If every stratum is such, then B is undefined. 


Breslow-Day’s statistic is asymptotically distributed as a chi-squared random variable with K-1 
degrees of freedom under the null hypothesis of a constant odds ratio. 
Tarone’s Statistic 


Tarone (Tarone, 1985) proposes an adjustment to the Breslow-Day statistic when the common 
odds ratio estimator is consistent but inefficient, specifically when we have the Mantel-Haenszel 
common odds ratio estimator. The adjusted statistic, Tarone’s statistic, for Oy is 
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K 
{pf en (fursleus Ov) y ps {fi ik FB (furlers émH) } 
k=l V (fursleres Over) 7 a oe ee 
se (Jaxer va) 
k=l 
7 {hf i: -E (fieleres én) } 
3 V (f te lCLks Overt) 
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where & and V are as before. 


The required data conditions are the same as for the Breslow-Day statistic computation. T is, 
of course, undefined, when B is undefined. 


T is also asymptotically distributed as a chi-squared random variable with K-1 degrees of freedom 
under the null hypothesis of a constant odds ratio. 


Estimation of the Common Odds Ratio 


For kK strata of 2 x 2 tables, write the true odds ratios as 
o, = Pie (1 — por) 
(1 — pik) Por 


for k = 1,..., 44. And, assuming that the true common odds ratio exists, 6 = @; = ... = 0x, Mantel 
and Haenszel’s estimator (Mantel et al., 1959) of this common odds ratio is 


If every stratum is such that fj2; = 0 or foi, = 0, then 6MH is undefined. The (natural) log of 
the estimated common odds ratio is asymptotically normal. Note, however, that if fii, =0 or 
foo, = 0 in every stratum, then 6M is zero and log (m1) is undefined. 
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The Asymptotic Confidence Interval 


Robins et al. (Robins, Breslow, and Greenland, 1986) give an estimated asymptotic variance for 
log (41 I 1) that is appropriate in both asymptotic cases: 


See (Sire + So2n) Srp Soon 
~2 , kod : 
a flog (x11) = Serre. 
= 7 S firk fook 
a et) 


K (fiig+fo2n) fronforet(fi2zkt fork) Sire Fook 
y 5 
ne 


k=1 


k 
2() us frikfo2k (S K fizkfoik 
k=1 Nk k=1 Nh 
K (Siok + form) S12K fou 
Se 


k 


Ko Sik hp 2 \? 
aaa a) 
An asymptotic (100 — a)% confidence interval for log (@) is 


(at) 0/2) ofa (sa 


where z (a/2) is the upper a/2 critical value for the standard normal distribution. All these 
computations are valid only if #47 is defined and greater than 0. 


The Asymptotic P-value 


We compute an asymptotic P-value under the null hypothesis that 6 (= @;,Vk) = 9 (> 0) against a 
2-sided alternative hypothesis (@ ¢ (), using the standard normal variate, as follows 


Pr (\z log (OH) ~los (eo ) a5 pe (z 2 log (Oyqqy) —log(%) ) 


[los (4H) [log (OH) | 
given that log ( mH) is defined. 


S 


Alternatively, we can consider using 77 and the estimated exact variance of 4, which is 
still consistent in both limiting cases: 


G2 [log (vex) | er 


Then, the asymptotic P-value may be approximated by 
MH 8 


Px(|z G\log °MH tor 


The caveat for this formula is that #9}7 may be quite skewed even in moderate sample sizes 
(Robins et al., 1986). 


Column Proportions Test 


This section describes the computation of the column proportions test. 
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Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 23-4 
Notation 

Notation Description 

R Number of rows in the sub-table. 

Cc Number of columns in the sub-table. 

A; ith category of the row variable. 

B; jth category of the column variable. 

fi; Total case weights in cell (i,j). 

Cj Marginal case weights total in jth column. 

C) Rounded marginal case weights total in jth column. 
Zz z-Statistic. 

a Chi-Square statistic. 

Pij Column proportion for cell (i,j). 

Pij Estimated column proportion for cell (i,j). 

Dijk Estimate of pooled column proportion of j and kth column in ith row. 
Pp p-value of a test. 

PB Bonferroni corrected p-value. 

a. The significance level supplied by the user. 


Conditions and Assumptions 
m Pairwise tests are performed on each row of all eligible innermost sub-tables within each layer. 
m The number of rows and columns in the sub-table must each be greater than or equal to two. 


m= Tests are constructed by using all visible categories excluding totals and sub-totals. Hiding of 
categories and showing of user-missing categories are respected. 


m= If weighting is on, cell statistics must include weighted cell counts or weighted simple column 
percents; a weighted analysis will be performed. If weighting is off, cell statistics requested 
must include cell counts or simple column percents; an unweighted analysis will be performed. 


m@ A proportion will be discarded if the proportion is equal to zero or one, or the sum of case 
weights in a category is less than 2; that is, if c; < 2. If less than two proportions are left after 
discarding proportions, test will not be performed. 


Statistics 


The following statistics are available. 
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Table Layout 

By Bo a Bc 
At Pit P12 “ Pic 
Ad P21 P22 P2C 
AR PRI PR2 “ PRC 
Hypothesis 


Without lost of generality, we will only look at the ith row of the table. Let C* be the number of 
categories in the ith row where the proportion is greater than zero and less than one, and where 
the sum of case weights in the corresponding column is at least 2. In the ith row, C*(C*—1)/2 
comparisons will be made among pj, pi2,..., pic. The (j,k)th hypothesis will be 


Aojk : Diy = Pik VS. Hijr : pijy F Pik 
Aggregated Statistics 


Column proportions tests are based on the aggregated proportions (/;;) and cell counts for each 
column (c;). Column proportions are computed using the un-rounded cell counts pj = ie which 
are equal to the proportions actually displayed. : 


Statistics for the (i,j)th Comparisons 


Pooled proportion: p;;, = 2——"* 


ej Ck 


(Pij—Pik) 


spo 1—pijr) (2+2) 


When multiple response set defines columns there may exist cases that belong to both jth and kth 
columns. Let ¢;;, be the rounded sum of weights for such cases. 


z Statistic with a categorical variable in the columns: z = 


(Pij—Pik) 


spr iin)(2 t ze a ) 


z statistic with a multiple response set in the columns: z = 


p-value: p = 2[1 — ®(|z|)| 
where ® (2) is the CDF of standard normal distribution. 


Alternatively, the statistic can be constructed as a chi-square statistic, 


9 


= 2? 


the p-value will now be given by p = 1 — F (\?; 1); where FF (; df) is the CDF of a chi-square 
distribution with df degrees of freedom. 


A comparison is significant if p<o (or pp < a, if Bonferroni adjusted). 


Bonferroni Adjustment 
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If Bonferroni adjustment for multiple comparisons is requested, the p-value will be adjusted by 


(tet ) 
PB = min il 


92 


Relationship to Pearson’s Chi-Square Tests 


With a categorical variable in the columns, the statistics used in column proportion tests is 
equivalent to the Pearson’s chi-square test on a 2x2 table by taking j and kth column and collapsing 
all rows except the ith row. Therefore performing column proportion tests on a 2x2 table will give 
you the same result as Pearson’s chi-square test. 


Use of Case Weights 
The case weights (or frequency weights) are supposed to be integers representing number of 


replications of each case. In column proportions tests, we will only check if the column marginal 
c;’S are integers. If not, they will be rounded to the nearest integer. 
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CSCOXREG Algorithms 


Survival analysis studies the failure time distribution. This algorithm considers the Cox 
proportional hazards regression model under the complex sampling setting. The failure time 
is assumed to be continuous here. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


R(t) 
Y; (t) 


d(t) 


S(t|x) 


h(t|x) 


t 
H (t\|x) = f h(u|x) du 
0 


ho (t) 
t 


Ho (t) = f ho (u) du 
0 


So (€) = exp (— Ho (#)) 


N 
n 


Ths 


For data with one time interval, the observed end time for record i. 


For data with two time intervals, the observed enter and end time for record 
1, tii < fai. 


The zero-one status indicator with 6; = 1 indicating end time f; or tz; being 
failure time, and 5; = 0 indicatingt; or t2; being right censoring time. 


The ordered observed failure times where K is the number of distinct failure 
times in the data set. 
Predictor vector for record i, ** = (®a,°**+i») . No intercept term. 


Vector of reference values for transforming predictors. For more 
information, see the topic “Predictor Transformations”. 


Design matrix X = (xj,---,X» )’. 

The set of records failed at time t. D(t) = {4: t; =t,6; = 1} for data with 
one time variable, and D (t) = {i : tz; =t,6; = 1} for data with two time 
variables. 


The set of records at risk at time t. R(t) = {7 : t; > €} for data with one time 
variable, and R(t) = {%: f1; < € < t2;} for data with two time variables. 

1 ifte R(t) 

0 otherwise 

The number of records failed at time t; that is, the number of records inD (t) 


The at-risk indicator for record i such that Y; (¢) = { 


Survival function at time t for a given predictor vector x, 
S(t) = Pr(T > t|x) where T is a random variable representing survival 
time. 
Hazard function at time t for a given predictor vector x, 
lim Pr(t<T<t+At|T>t|x) 
h(t\|x) = HES [Potle) | 
as At > 0" ae 
Cumulative hazard function at time t for a given predictor vector x 


Baseline hazard function at time t, f,, (t) = h( t|x = 9). 


Cumulative baseline hazard function at time t. 


Baseline survival function at time t. 


The number of cases in the whole population. 
The number of cases/records in the sample. 
The number of subjects/individuals in the sample. 
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wi Sampling weight for record i, w; = 1/7. 
B The parameter of interest, the population or census parameter. 


B The estimate of census parameter B from the sample. 


Input 


Sampling plan. This plan is needed for sampling method, sampling weight, strata and cluster 
information. 


Observed sample data. Two kinds of data structures are allowed. 


2 


m= Data with one time intervals: {t;,0;, x), w;}!",. 


m Data with two time intervals: {1,;, f2;,0;,x;, wi})_,, or {id, t1;, fai, 0;. x;, wi}! ,, where 
(t,;,¢2;] is the time interval during which the record is at risk, and id ;is the subject id for 
record i. Multiple records for the same subject have the same id and same sampling weight. 
Multiple records of the same subject should have disjoint time intervals. If id;is not specified, 


each record is assumed from different subject. 


Note: Data with one time interval is simply a special case of data with two time intervals where 
t}; = 0 and t.; = ¢t;. The rest of this document is written from the perspective of data with two 
time intervals. 


Predictor Transformations 


To decrease the chance of over- or underflow when calculating exp(.), first a transformation 

z = x — 79 is performed on each predictor for a properly chosen x, (reference value). Then all 
the calculations described in other sections are performed on the transformed data. Except for 
baseline hazards and baseline survival functions, all other quantities based on transformed data 
are the same as those based on original data. 


For a continuous predictor x in the original covariate list, the reference value xg is chosen to be 


UX; 


L9° =.=. 


; Wi 


n 
l 
i=] 
n 
i=1 


Note that x9 is not the mean of x when there are multiple cases per subject or x is a time dependent 
predictor. 


For a categorical predictor, the last category is the reference value. 


The reference values for model effects derived from original predictors, such as interactions, are 
derived from the reference values of the original predictors in the same way the effects are derived. 
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Proportional Hazards 


Model 


Two phases of sampling are assumed. The first phase generates a finite population by a model 
or super population. The second phase selects a sample according to a sampling plan from the 
finite population generated in the first phase. 


For a given predictor vector x the hazard function at time t is 
h (t|/x) = ho (t) exp {x'3} 
or 
h(t m 
n (Soa) 7%? 


where /2g (¢) is the baseline hazard function. The regression parameter vector doesn’t include an 
intercept term because the intercept can be absorbed by the baseline hazard. 


Survival and cumulative hazard functions 


From this model the cumulative hazard function is 


t t 
H (t\x) = f h(u|x)du = exp (x 8) J ho (u) du = exp (x 8) Ho (t) 
0 0 


where Hy (t) = f ho (u) du is the baseline cumulative hazard function. The survival function § 


os 


Na OUTS 3) Ho(t) } = {S0(¢ )perPt A) 


where Sp (/) = exp (—Hp (t)) is the baseline survival function. 


Pseudo Partial Likelihood and Derivatives 


For a sample $ = {f,;, f2,,6;, x,, w;}!'_, drawn from the finite population according to a sample 
plan, we take the pseudo-likelihood approach. In this approach, pseudo-likelihood is a sample 
estimate of the population log-likelihood, and parameter estimates are derived by maximizing 
the pseudo-likelihood. Let ls (:3), Ug (3) and Js (3) denote the pseudo-likelihood, its first and 
second derivatives. 
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For the Breslow approximation: 


n 


isi h) = s W50j x ,8 —In a wy exp (x'13) 


i=1 LE R(to;) 


For the Efron approximation: 


d(t2;)-1L 


n 
r ~ / 1 i 
Is (8) = Se wo; | x,8— T(t) oS ln > WL eXp (x;3) = 
(12; 


r=0 1E R(t2:) 


Let 


n 


EO) (B,t)= > wrexp (x'13) = S¢ wii (t) exp (x'13) 


le R(t) l=1 


= lED(te:) 


EY (8,t) = pe = a wyX; exp (x') a a wy) (t) xX) exp (x'13) 


le R(t 
. : At 7 B t) ! 2 = , Es , . ! 2 
(2) B.t)= a2 po) ( i = ; (x'13) = wy Y7(t) xx 7 exp (x 13) 
EX" (G,t) B08 S> wyxyx'pexp S 
lE R(t) 


a 
> 


~ Dene (t) WI exp (5,0? ei), = 2) 


BED (Bt) = ay (gptteXD luton), yer 
; = gay ) X] CXp ie.) 
= Zot a , 
22 Wl XX | EXP 6) 
EE) (6,t,r) = DIER(t) a a din Lien 
ro; 2U 2 
= yw (Yi(e) ~ ltt exp (8) 
(1) 
| ag CAS 
x (6,17) = —— 
FE) (6,t,r) 
ara Breslow 
x (B,t) = d(t)-1 EE (8,t,r) 
; yo EEO(B Er) Efron 
d(t) 


wx) exp (x 13) 


WyX/X | exp (x )3) 
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BE (8,t)(B(B,0) — B(8.4) 


I; (8,t) = Cy Eos) Breslow 
1 d(t)—1 
aw dG 2G Efron 
So 
anes i8 —n EO (8, tea Breslow 
Ig (8) = 
"1 Wid (x8 — g 7 gu “In EE) (8, taisr)) Efron 
Ba div Jlg (3) : 
Us (3) = ae = 3 wjd;U; (3, tai) 
i=1 
» _ ls (B) Gr. ory 
Js (3) = 9308- = > wi]; (8, tai) 
es i=1 


These equations are used to calculate the needed quantities throughout the rest of the document. 
When predictors are time-dependent, these equations need to be modified accordingly. For more 
information, see the topic “Time-Dependent Predictors”. 


Parameter Estimation 


To obtain the maximum pseudo-likelihood estimate of B, the Newton-Raphson iterative estimation 
method is used to solve the estimating equation. Redundant parameters are fixed at zero for all 
iterations. Let B‘") be the parameter estimate at iteration step v, the parameter estimate B‘”*!) at 
iteration step v + 1 is updated as 


BYU+) — BW) _¢ (Us (B™)) Us (B”) 


where (J5(.)) is a generalized inverse of J;(.). The stepping scalar €>0 is used to make 
Is (B'°*!)) > 1s (B"”). Use the step-halving method if /s (B‘"*!)) < Is(B‘"’). Let s be the 
maximum number of steps in step-halving; the set of values of € is then {1/2': r= 0, ..., s-1}. 


Starting with initial value B'°), update B‘’~*!? until one of the stopping criteria is satisfied. The 
final estimate is denoted as B. 


Initial values 


By default, B‘”) = 0. 
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Stopping criteria 


Given two small constants c, > 0 and c,, > 0, the iteration stops if one of the following criteria is 
satisfied: 


1. Pseudo-likelihood criterion 
LE alata a €] if relative change 


's(BO )]+10-6 
lis (BOTY) —1s (B)| <e if absolute change 


2. Parameter criterion 


Wax j if relative change 


max; (aie — B;” ) c, if absolute change 
3. The maximum number of iteration is reached, or maximum number of steps in step-halving 
is reached. 


Either relative or absolute change is considered in criteria 1 and 2. 


Infinite valued parameters 


There may be situations in which the maximum pseudo-likelihood estimates of some parameters 
are infinite. For example, if there is no failure at one level of a binary predictor, the estimated 
parameter would be infinity for this predictor. In this situation, the estimation procedure is 
performed as usual. At the end of the estimation, we will check for possible infinite parameters 
and issue warnings if there are any. Parameter 5; is possibly infinite if both of the followings are 
satisfied: 


1. B; 


(2 jmax ~ Lj min) = 10 


2. The Hessian is singular, or se (2;) / | B; 


ao 


When there are infinite valued parameters, the Wald statistic for hypothesis testing involving 
infinite valued parameters becomes worthless. 


Properties of Estimates 


Variance matrix 


Let 


7 , Sm WmUi(B,t2m) exp(x 3) 
O;u; (6, tai) = Do ndecwttal E(B, ts) Breslow 


, - . 5, Woy d(ls,,)—1 
U; (9) — dj; (9, tai) _ Di fmnitys<tam Star} di ) Zee 


w.(9stamr)exp(x’ 9) (1- “Una ) Efron 


EE) (8,tom,r) 
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We will use the following robust variance estimation (Binder 1992, Lin 2000), 


(8) = (%6(8)) 1 (8) (%6(8))" 


where / (3) is the estimate of the design based variance of U's (3) with 
12 . 
r By dra 
Us (3) = a wjyUS" (3) 
j=l 


US) 3 Ue) 
ic{id,=j} 


Notice that the sum in U's (,3) is over all n, subjects, and the sum in P(e (3) is over all records 
for subject j. The Us (.) is an estimate for the population total of U\”) (.) vectors. For more 
information, see the topic “Complex Samples: Covariance Matrix of Total”. 


Confidence interval 


The confidence interval for a single regression parameter 6; is approximately 


Bj — tapas B;) taf 


where t,1— a is the 100 (1 — a/2) percentile of a ¢ distribution with df degrees of freedom. 


The degrees of freedom df can be user specified; its default value is the difference between the 
number of primary sampling units and the number of strata in the first stage of sampling. 


Design effect 


For each parameter 5,, its design effect is the ratio of its variance under the design to its variance 
under the SRS design, 


Def f (Bj) = aa) 


For SRS design, the variance matrix is 


rons (8) = (4(8)) Tons (8) (5(8)) 


where 
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j=1 
Ths 
N= a Wy 
j=l 
dis ea ; . 
f ge with finite population correction. 
IC = N : ‘S : : 
ue 1 without finite population correction. 
t Tests 


Testing hypothesis Hy : 8; = 0 for each non-redundant model parameter B; is performed using 
the t test statistic: 


The p-value for the two-sided test is given by the probability ? (a > 1 (B i] ), where T is a 


random variable from the t distribution with df degrees of freedom. 


Exponentiated parameter estimates 


exp (B;) can be interpreted as a hazard ratio for main effects model. Its 1 — a confidence interval is 
av (4 (2)).<0& (@)) 


where L (B 7 JU ( B;) are the lower and upper confidence limits for census parameter 5;. 


Survival and Cumulative Hazard Functions 


In this section, t} <---< ¢}, are the ordered observed failure times, and ¢5= 0, tj, ; = 00 


are used for convenience. The estimates are valid for * © |0; maxi (tai), 


Estimation of Baseline Survival and Cumulative Hazard Functions 


Only one of these needs to be estimated because Ho () = —In Sy (t) and So (t)= exp (—Ap (t)) 
The baseline functions are estimated by right continuous step functions with jumps only at 


observed failure times; that is, So (t) = So (t5) and Ho (t) = Ho (t5) for te [5s ti4a). 
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Product-limit Estimate 


The non-increasing right continuous baseline survival function So (¢) is estimated here. Let 
the ratio jump be a; = So (t*)/ Sy (t*_,) for j = 1 to K, and ag = 1, sO Sp (t) = Tgi.te. 1p U 


Assuming that the regression parameters are given, {a ; a _ will be the parameters to be estimated 
by maximum likelihood estimation. 


Pseudo likelihood and its derivatives 


Let f (t|x) be the probability density function of failure time at t for a given predictor. The 
pseudo likelihood is 


Kk ' 
/ exp(x ,3 a 
ls(a, 3) = > BD w; ln (1 — a, PI : a S- Wj exp (x iB) Ina; 
J=1 | ieD(t;) iE R(t; )—D(tz) 
We will estimate a; by maximizing ( a, B) , which is equivalent to solving \"8) = 0 and 
hence the following equation. 
exp (xB) a 
De wT exp 8) = S- Wy CXp (x B) 
ie D(t;) a ic R(t?) 
Failure times of single failure 
If there is only a single failure 7; at failure time t', there exists a closed form solution, 
exp( —x ,,B) 
7 i oe exp( -x' i, 
wi, exp (x B) wi, eEXp (x .B) 
a;=|1 a =a (el —— 
> Ww; exp (x B) EV) (B, r*) 
i€R(ts) 


Failure times with tied failures 


If there are multiple failures at failure time ¢;, Newton’s iterative method is used to solve the 
equation with constraint a; € (0,1). A good initial value is 


be wy  s wy 


leD(t;) leD(t;) 


— = exp | — 
> wy eXp (xB) 


leR(t>) 


ajo = exp 
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Kaplan-Meier estimator: a special situation of no predictors 


When there are no predictors; that is, x = 0 always, the product-limit estimator becomes the 
Kaplan-Meier estimator, 


iE R(t>) 


Breslow, or Nelson-Aalan, or Empirical, Estimate 


Here Hy (¢) is estimated by a non-decreasing step function with steps at observed failure times: 


) Wi 


ic D(tz) 
feieg<y EO) (B,%;) 


where NV; (/) is the count of failures up to time t for record i. 


Ho (t) = 


Efron Estimate 


When there are ties in failure times, the following estimation can also be used. This will reduce 
to Breslow when there are no ties. 


p2 8 a(ez)—1 f 
Ay(t)= >- Cae y 


{kit <t} rao BE) (Bee) 


Prediction of Survival and Cumulative Hazard Functions 


For a given x, the cumulative hazard function and survival functions are predicted by 
H (tise) = Ao (t) exp (xB) 


‘ : exp(x B) 
S (t |x) = exp (—A (t|x)) = (So (t)) BS 


where Hy (t) and So (¢) are the estimated baseline cumulative hazard function and baseline 


survival function. 


For variance calculation, the same formula will be used regardless of different ways to estimate 
baseline functions. The variance for cumulative hazard is 
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V(a (i|x)) x | Var be wiqh” (t|x) 
j=l 


q;''(t|x) q,")(t|x) 


where 


G7 (tx)= So Gtx) 
ic {id;=j} 


di (¢ 1) = 04 (t|30) - (A(t) (4s (B)) Us, (B) 
uj (t|x) = bit (tai St) u ye (tor < t) Yi (tar) exp (x 55) 
vi (EO (B,taix)) 


n 1 (tr < t) ED (B, tailx 
A(t|x) = -\— WI ( = 


(m0 (B,ratx)) 


l=1 
) 


= wy, (t) (x; — x) exp (x -x)B) 
l=1 


and Jg (3) and U; (3) are defined in “Pseudo Partial Likelihood and Derivatives ” and “Properties 


of Estimates ”, respectively. See Lin (2000) for more details. Var: s w, q\ “ (+) | is the 


j=l 
Ns 
design-based variance of 2 w; he (t) which is the estimated population total of q;” (é). 


j=1 
For more information, see the topic “Complex Samples: Covariance Matrix of Total”. 


The variance estimate for the survival function is 
‘ P 5 2 - 
V(S(lx)) = (Sebo) V (A(x) 


Confidence interval for survival function 


A confidence interval for S'(¢|x) can be calculated in the following ways. Let 
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S (tx) original 
Y (t) = In S (¢|x) log | 
In (- ln S(t x) log — log 


the confidence interval for S (t|x) at 1 — a level is 

Y (t) + zgjar/V (Y (t)) = $ (t|x) £ zayar/V (Y (8) original 
exp (» (t) + Zas2 V(Y )) =$§ (t|x) exp (+20 V(Y )) log 

exp {exp (rw trav (Y ))} = ((tlx))" (#22VV(V()) og Jog 


where 2, /2 is the 1 — 5 upper percentile point of the standard normal distribution and 


(Six) V (a (t|x)) original 
V(Y)= V(a (tIx)) log 


(A (tx) VCH (t}x)) tog log 


Please note that the first two confidence intervals may have values greater than 1 or less than zero 
(we can truncate them to 0 or 1 if they are out of range). The third one always between 0 and 1. 
However Link (1984 & 1986) suggested that the second one performed the best. 


Residuals 
Some residuals defined below depend on the baseline cumulative function. Three estimation 
methods for baseline cumulative function are available to user. If users don’t request estimation 
of cumulative hazard or survival function, but request for residuals, then use Breslow estimate 


if Breslow approximation is chosen in estimating the parameters, and Efron estimate if Efron 
approximation is chosen in estimating the parameters. 


Schoenfeld’s partial residuals 


This is calculated only for observations with 6; = 1. 


roy = wu; (B, Iai) 


where u; (.) is defined in “Pseudo Partial Likelihood and Derivatives ”. 
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Martingale residual 


7M) = 0; — (Ao (tai) — Ao (ti)) exp (xB) 


Deviance residual 


() ee (ff) re [-rf™ — dln (6 a i) 


Cox-Snell residual 


pfCS) = (Ao (tai) — Ho (Hi)) exp (xB) = 6; — AM) 


Score residual 


2) = wiU; (B) 


where U; (:3) is defined in “Properties of Estimates”. 


DFBETA 


DFBETA that measures the influence of record i on parameter estimate is 
_— wide (B) U; (B) 


This is approximately the parameter change, B-B, ;) » Where B, i) is the parameter estimate 
when the ith record is omitted. 


Aggregated residual 


When there are multiple records representing a single subject (as in data with two time variables), 
residuals can be given for each subject rather than for each record. Except for Schoenfeld’s and 
deviance residuals, the aggregated residual for a subject is simply the sum of the corresponding 
record residuals over all the records belonging to the same subject. Please notice that aggregation 
can only be done for data in the format {id;,¢;),/:2,0;, x;, w;}!’_,. For Schoenfeld’s residual, 

the aggregated version is the same as that of the non-aggregated version because Schoenfeld’s 
residual is only defined for records with 5; = 1. For deviance residual, the aggregated residual can 
be derived using the aggregated Martingale residual. 
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Baseline Hazard Strata 


Cox regression can be extended to allow multiple baseline hazard strata (note that these are 
different from the sample design strata). The baseline hazard strata divide the subjects into disjoint 
groups, each of which has different baseline hazard function while the regression parameter 

G stays the same for all baseline hazard strata. 


Suppose there are G baseline hazard strata. For baseline hazard stratum g, the model becomes 
hg (tic) = hog (t) exp {x'3} 


Let V, be the set of records belong to baseline hazard stratum g. Adding the subscript g to a 
quantity denotes that it is calculated only using data in V, a For baseline stratum g, the previously 
defined quantities would be {HS EY) (8, Dt : {BEY ( 3, D} .X, (8, t). ugi (8,4). Ii (8, t). 

=0 
Isq (8). Usg (3). Jsg (8). Ugi (8). 


The overall pseudo partial likelihood. its first and second derivatives become Is “> Isq (8). 


Us (B) = Ys. Js (8) = YJ () 


The parameter B can be estimated by maximizing /s (3) as before. The variance of the parameter 
estimates and design effects are calculated by the same formulae with the following modifications: 


UP (B= SS Uns) 
ie{id,=j} 
where k:; is the baseline stratum that case i belongs to, and the sum is over all cases for subject j, 
no matter which baseline stratum the case is in. 
After the regression parameters are estimated, the cumulative hazard and survival functions can 
be estimated for each baseline stratum separately using the same formula but on data only from 


that stratum. Let H., q (t|x) denote the estimate of stratum g’s cumulative hazard function at time t 
for a given predictor x Its variance calculation is similar as before but with the following changes. 


Ns 


V (My (i|x)) ~ | Var S- wid (t|x) 
j=l qj (tx) qd, (t|x) 
where 


(a) | 


dg. j (tx) = Ss” dg,kii (t|x) 


Gigi ( ) = Ugi (t1 X)I (i € Vg) — (Ag (tl) (Us (8)) “u,, (B) 
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and wy; (t |x), Ay (¢| x) are calculated by the same equations as before but only using data from 
stratum g. 


Given regression parameters at the estimated values, the residual for each record is calculated 


based on the data only from the stratum that the record belongs to. If record i belongs to stratum g, 
then in its residuals calculation, simply replace u;, Hy, U; by uy;, Hog, Ugi- 


Time-Dependent Predictors 


Cox regression can also be extended to allow time dependent predictors, x = x (t). The Cox 
regression model becomes 


h(t|x (t)) = ho (t) exp {x (1) 3} 


The previously defined equations still apply by simply replacing x with x (t) accordingly. 


Note: If the values of a time-dependent predictor only depend on time and not the case number, 
then this predictor will be absorbed in the baseline hazard function. The regression parameter 
for this predictor is set as redundant. 


Predictors 


All predictor values for records in the risk set at each failure time are needed in the calculation. 
Two kinds of time dependent predictors are allowed: piecewise constant predictors, and predictor 
values that can be calculated at all the needed times. 


Piecewise constant predictors 


Often the predictors for a subject are measured many times during the study. Between 
measurements, the predictor value is assumed to be unchanged. Data with two time variables can 
handle this kind of piecewise constant predictors. For each subject, multiple records with two time 
variables (see “Input ”) are created, one record for each distinct pattern of the time-dependent 
measurements. The predictor values are constant for each record. This becomes the two failure 
time variables with time-independent covariate situation. 


Note: it is the user’s responsibility to create the data set of two time variables. 


Calculatable predictors 


The predictor values can be calculated and hence known at any time point; for example, the age of 
a subject. The TIME PROGRAM command is used for this purpose. 
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Survival and Cumulative Hazard Functions 


For product-limit estimate, solve for {a;,} , from: 


exp (x; (t;) B) are 
a wi; =o = s Ww; exp (x; (t;.) B) 


i€ D(tz) iE R(t;) 


For Breslow estimation: 
yy ie D(ty) Wi 
{kt <t} ye R(tz) EER (x; (ti) B) 


For Efron estimation: 


Ho (t) = 


, 2) D(t* wi 
Halt) = Difk:te<t} a 
1 


der © ne n(n) We eXP(X (EB) TET Den (zy We eP(%(H)B) 
k ke 


Using the fact that Hy (¢) and Sy (¢) are right continuous step functions with jumps only at observed 
failure times, then for a given predictor path up totime T: {x (wu) : wu < T}, the cumulative hazards 
and survival function are estimated by step functions. For ¢ < min (T,max; (t2;)) 


H (t\{x(u):u<t}) = Pa (Ho ae Hy (¢j- 1) i 3 (¢5) B) 
{ j:t;<t} 
4,(ts) exp(x (t7)B) 


leo 
{5:t5<t} So (3 1) 


The variance of H (t\{x (uw) : u < ¢}) can be calculated as in the case without time-dependent 
predictors, but with the following changes: 


Stix (usa <eh) = 


5,1 (toi<t 
uj (Blix (u) US th) = Saree ix(ta) - ¥ me ) 
5,1 (tor <t)Y; (tor ae (tar) =x (ta )'B) 


=r 1 I 


(E (B,to:|x(to.))) 


"dE (tor <1) EY (B, toylx (20) 
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n 


E“) (B, tx) oe wy Y(t) (x; (t) — x (t)) exp (x(t) -x(t)) B) 
el 


There is no agreeable interpretation of the survival function when there are calculatable 
time-dependent predictors. Survival curves based on a time-dependent covariate must be used 
with extreme caution. 


Residuals 


When there are time dependent predictors, all residuals are calculated in the situation where data 
with two time variables are used to handle the time-dependent predictors. Only Schoenfeld’s 
residual, score residual, and DFBETA are calculated in other situations. 


Hypothesis Testing 


Contrasts defined as a linear combination of regression parameters can be tested. Given matrix 
L with r rows and p columns, and vector K with r elements, we test the linear hypothesis 


Hy : LB = K if it is testable. For more information, see the topic “Complex Samples: Model 
Testing”. 


Testing Model Assumptions 


Tests are performed by considering bigger alternative models involving additional parameters. 
When fitting alternative models, initial values are set to 0 for all additional parameters 

and 3 = B for old parameters where B is the previously estimated value of model 

h (t|x) = ho (t) exp {x's}. 

If there are baseline hazard strata or time dependent covariates in the original model, then the 
alternative model should also include them. The only difference between the original and the 
alternative model is that there are more predictors in the alternative model. 


Testing Proportional Hazards 


A key assumption of Cox regression is proportional hazards. When predictors are constant, the 
hazard ratio panes = exp 4 (x9 — xX, ) 3 is independent of time, so the hazards at different 
predictor values are proportional. We test the adequacy of the proportional hazards assumption 
by considering an alternative model with time-dependent coefficients. Suppose that there are p 
predictors, and we are interested in testing the proportional hazard assumption for p* predictors, 


assuming the first p* predictors without loss of generality. 


Specific alternative model 


Consider the alternative model 
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h(t|x) = ho (t) exp {x's (o} = hg (t) exp {x's +z (t) o} 


where z (t) = (x19: (t),-+°, p- Gp (t)) is a time dependent predictor vector, and 
gi (t) +++, Yp> (t) are p* user-specified functions of time, one for each of the predictors of interest. 
This is a proportional hazards model with time dependent covariates with parameter vector 


(3',0") . Fit this model and test Hy) : 6 =0. 


For the time functions, the available options are 


t identity 
gi) = Int log 
i rd(t) rank 


1— Sx (t) KM 


where Sx jz (¢) is the Kaplan-Meier estimate of the survival function, and rd(t) is 


1 t<tj 
rd(t)= gt [th 69) 
K+1 ar 
For simplicity, we will only allow gj (t) =--- =4,- (t) = g(t). By default, p* = p and 


g(t) = 1—Skm (t). 


Note: When there are baseline strata, rd(t) and Sx az (t) are calculated based on the whole data, 
not any individual strata. 


Subpopulation Estimates 


When analyses are requested for a given subpopulation, we perform calculations on the redefined 
data such that if the ith record is not in the subpopulation, then 


th; to; 0, 6; 0, x; 0 


In the estimations of regression parameters and the survival/cumulative hazard functions, this 
substitution is equivalent to including only the subpopulation elements in the calculations. In the 
calculation of variance / (3) and Ispg (3), this means that U; (3) —0. ifthe ith record is not 
in the subpopulation. 


Missing Values 


List-wise deletion is used to determine which records are used in the analysis. Negative failure 
times, ¢; or ¢,; or t;, are considered missing. 


CSCOXREG 
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CSDESCRIPTIVES Algorithms 


This document describes the algorithms used in the complex sampling estimation procedure 
CSDESCRIPTIVES. The data do not have to be sorted. 


Complex sample data must contain both the values of the variables to be analyzed and the 
information on the current sampling design. Sampling design includes the sampling method, strata 
and clustering information, and inclusion probabilities for all units at every sampling stage. The 
overall sampling weight must be specified for each observation. 


The sampling design specification for CSDESCRIPTIVES may include up to three stages of 
sampling. Any of the following general sampling methods may be assumed in the first stage: 
random sampling with replacement, random sampling without replacement and equal probabilities 
and random sampling without replacement and unequal probabilities. The first two sampling 
methods can also be specified for the second and the third sampling stage. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


H Number of strata. 
Nh Sampled number of primary sampling units (PSU) per stratum. 
Sh Sampling rate per stratum. 
Mhi Number of elements in the ith sampled unit in stratum h. 
Whij Overall sampling weight for the jth element in the ith sampled unit in 
stratum h. 
Vhij Value of variable y for the jth element in the ith sampled unit in stratum h. 
Y Population total sum for variable y. 
n Total number of elements in the sample. 
N Total number of elements in the population. 
Weights 


Overall weights specified for each ultimate element are processed as given. See “Weights ” in 
Complex Samples: Covariance Matrix of Total for more information on weights and variance 
estimation methods. 


Z Expressions 


hij = Whig Vhij 


Mri 


thi = s “hij 


j=l 
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nh 


i=1 


For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for unit 
from h in the same stage. The index j runs over all final stage elements contained in unit hi. 


Variable Total 


An estimate for the population total of variable y in a single-stage sample is the weighted sum 
over all the strata and all the clusters: 


H nn Mnri 


y= y: > » Whij Vhij 


h=1 i=1 j=1 
Alternatively, compute the weighted sum over all the elements in the sample: 


n 


Y = » Wi Yi 


i=1 


The latter expression is more general because it also applies to multi-stage samples. 


Variable Total Variance 


For a multi-stage sample containing a with replacement sampling stage, all specifications other 
than weights are ignored for the subsequent stages. They make no contribution to the variance 
estimates. 


Single Stage Sample 


The variance of the total for variable y in a single-stage sampling is estimated by the following: 


V (1) =V, (v) = vo 


where U}, is an estimated contribution from stratum h and depends on the sampling method 
as follows: 


m For sampling with replacement: U), = ny, S? 
m For simple random sampling: U;, =(1 — fi.) np S? 


m For sampling without replacement and unequal probabilities: 


Mh Mh fo 
U, = ye > (“ats 7 1) (2% — Zpj)” 


aes 
i=tisj SRY 
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7p; and 7;,; are the inclusion probabilities for units i and j in stratum h, and 7,,;; is the joint 
inclusion probability for the same units. This estimator is due to Yates and Grundy (1953) and 


Sen (1953). 


For each stratum h containing a single element, the variance contribution U), is always set to zero. 


Two-stage Sample 


When the sample is obtained in two stages and sampling without replacement is applied in the 
first stage, use the following estimate for the variance of the total for variable y: 


Kini 


V(v)=n(v)=n(0 YS domed the 


h=1 i=1 
where 


™ x», is the first stage inclusion probability for the primary sampling unit i in stratum h. In 
the case of simple random sampling, the inclusion probability is equal to the sampling rate 
f, for stratum h. 


m= = x;,; is the number of second stage strata in the primary sampling unit i within the first stage 
stratum h. 


@ Uh;; is a variance contribution from the second stage stratum k from the primary sampling 
unit hi. Its value depends on the second stage sampling method; the corresponding formula 
from “Single Stage Sample ” applies. 


Three-stage Sample 


When the sample is obtained in three stages where sampling in the first stage is done without 
replacement and simple random sampling is applied in the second stage, we use the following 
estimate for the variance of the total for variable y: 


H np Kni Nnik Lhik 
v(y) =Ve (¥)« yoy, trix >, - Uhikjl 
h=1 i=1 k=1 g=1 [=1 


where 
@ fi, is the sampling rate for the secondary sampling units in the second stage stratum hik. 
= Lx; is the number of third stage strata in the secondary sampling unit hikj. 


@ Uy ix; is a Variance contribution from the third stage stratum / contained in the secondary 
sampling unit hikj. Its value depends on the second stage sampling method; the corresponding 
formula from “Single Stage Sample ” applies. 
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Population Size Estimation 


An estimate for the population size corresponds to the estimate for the variable total; it is sum of 
the sampling weights. We have the following estimate for the single-stage samples: 


H nh mni 


More generally, 


n 
N= a Ww; 


i=1 
The variance of NV is obtained by replacing 4), ; ; with 1; that is, by replacing z),;; with w,,;; in the 
corresponding variance estimator formula for V (Y i: 

Ratio Estimation 
Let R=Y/X be the ratio of the totals for variables y and x. It is estimated by 
R=v/X 


where Y and_X are the estimates for the corresponding variable totals. 


The variance of FR is approximated using the Taylor linearization formula following Woodruff 
(1971). The estimate for the approximate variance of the ratio estimate V (22) is obtained by 
replacing 2),;; with 


Zhij = Whij (unis = Reni; )/ X 


in the corresponding variance estimator V ( ) ‘i 


Mean Estimation 
The mean Y’ for the variable y is estimated by 
Pavyn 
where Y is the estimate for the total of y and N is the population size estimate. 


The variance of the mean is estimated using the ratio formulas, as the mean is a ratio of Y and 


N. Accordingly, V ( Y) is obtained by substituting ),;; with 


2hij = wnig (nis a Y)/N 


in the corresponding variance estimator V ( Y). 
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Domain Estimation 


Let the population be divided into D domains. For each domain d define the following indicator 
variables: 


Sct 2 1 if the sample unit hij is in the domain d 
Oni) = 1 Q otherwise 


To estimate a domain population total, domain variable total, ratios and means, substitute 4; with 
0; (d) y; in the corresponding formula for the whole population as follows: 


n 
@ Domain variable total: Y= wd; (d) y; 
i=1 
: n 
= Domain population total: Ny = y wo; (d) 
i=1 


= Domain variable ratio: Ry =Y,/Xq 
m Domain variable mean: Y , = Ya/ Na 
Similarly, in order to estimate the variances of the above estimators, substitute y),;; with 


Oni; (d) ynij in the corresponding formula for the whole population. The following substitution of 
z;; in the formulas for V (v) are used for estimating the variance of: 


m™ Domain variable total: 2),;; (d) = dnij (d) wnijynij 
= Domain population total: z),;; (¢) = dni; (d) wri; 


= Domain variable ratio: 2;,;; = dnij (d) wri; (wis — Ratnij) /Xq 


@ Domain mean: 2p); = dnij (d) whi; (nis a Ya)/Na 


Standard Errors 


Let Z denote any of the population or subpopulation quantities defined above: variable total, 
population size, ratio or mean. Then the standard error of an estimator Z is the square root of its 
estimated variance: 


StdError (2) =4/V (2) 


Coefficient of Variation 
The coefficient of variation of the estimator 7 is the ratio of its standard error and its value: 


CV (2) = SE(Z) 


Z 


The coefficient of variation is undefined when Z = 0. 
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T Tests 


Testing the hypothesis that a population quantity Z equals 9»; that is, Hp : Z = 4, is performed 
using the t test statistic: 


a\ _ Z—0o 
t(Z) ~~ StdError(Z) 


The p-value for the two-sided test is given by the probability 


‘(2))) 


where T is a random variable form the t distribution with df degrees of freedom. 


P (Ir > 


The number of the degrees of freedom is calculated as the difference between the number of 
primary sampling units and the number of strata in the first stage of sampling. 


Confidence Limits 


A level 1—a confidence interval is constructed for a given 0 < a < 1. The confidence bounds are 
defined as 


Z +t StdError (2) taf (1 = a/2) 


where StdError (Z ) is the estimated standard error of Z , and tag (1 — a/2) is the 
100 (1 — a/2) percentile of the t distribution with df degrees of freedom. 


Design Effects 


The design effect Deff is estimated by 


V ( Y ) is the estimate of the variance of Y under the appropriate sampling design, while 


\ on (tA is the estimate of variance of Vos under the simple random sampling assumption 


as follows: 
. ~aA\ 2 
ss nx Ng 
Vers (3 is) = (fpc) c= oe Wi (« a: 4 
i=1 * 
Assuming sampling without replacement we have fpc = (1 = x) given that + < 1, while for 


sampling with replacement we set fpc = 1. This assumption is independent of the sampling 
specified for the complex sample design based variance V (¥ 
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Whereas design effect is not relevant for estimates of the population size, we do compute the 
design effects for ratios and means in addition to the totals. The values of variable y in V.,-. are 
then replaced by the linearized values as follows: 


™ Ratio estimation ( Yi Re;)/X 


m= Mean estimation (vi = Y) /N 


When estimating design effects for domains we use the familiar substitution 4; (d) y; fory, in the 
V,,s formula in addition to any ratio or mean substitutions. 


We also provide the square root of design effect \/Dcf f. 


Design effects and their applications have been discussed by Kish (1965) and Kish (1995). 
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CSGLM Algorithms 


CSGLM is a procedure for regression analysis as well as analysis of variance and covariance 
based on complex samples. 


Complex sample data must contain both the values of the variables to be analyzed and the 


information on the current sampling design. Sampling design includes the sampling method, strata 
and clustering information, inclusion probabilities and the overall sampling weights. 


Sampling design specification for CSGLM may include up to three stages of sampling. Any of the 
following general sampling methods may be assumed in the first stage: random sampling with 
replacement, random sampling without replacement and equal probabilities and random sampling 
without replacement and unequal probabilities. The first two sampling methods can also be 
specified for the second and the third sampling stage. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Total number of elements in the sample. 
Pp Number of regression parameters in the model. 
Y Dependent variable vector containing values y;,i = 1,...,n. 
Xx nxp design matrix. The rows correspond to the observations and the 
columns to the model parameters. The ith row is x i,4 el eer 1 
WwW Diagonal matrix with sampling weights w;,i = 1,...,n on the diagonal. 
B Vector of p unknown population parameters. 
N Total number of elements in the population. 
Weights 


Overall weights specified for each ultimate element are processed as given. See “Weights ” in 


Complex Samples: Covariance Matrix of Total for more information on weights and variance 
estimation methods. 


Model Specification 


Let the linear model be specified by the equation Y=XB+E, where Y is a vector of observed 
dependent variable values, X is the linear model design matrix, B is a vector of model parameters 
and E is a vector of random errors with zero mean. Each column of the design matrix corresponds 
to a parameter in the model equation. Each parameter corresponds to one of the intercept, factor 
main effects, factor interaction effects, factor nested effects, covariate effects and factors by 
covariates interaction effects. For every factor effect level occurring in data there is a separate 
parameter. This results in an over-parametrized model. 
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Estimation 


Assuming that the entire finite population has been observed, we can obtain the least square 
parameter estimates for the linear model by solving the following normal equations 


X' vyXy8=X' vYy 


where X y and Y , denote the design matrix and dependent variable for all elements in the given 
population. A solution vector for this system, estimating the model parameters B, is denoted by 
B. In our analyses we take the established design-based approach concerned with estimating the 
finite population parameters B developed by Kish and Frankel (1974), Fuller (1975), Shah, Holt 
and Folsom (1977) and others. See Sarndal et al. (1992) for an overview. 


Estimates for the population matrices X’, Xy and X , Yy are given by X WX and 
X WY respectively. We solve the following set of weighted normal equations 


X WXB =X WY 


where W is a diagonal matrix with sampling weights w;,i = 1...n on the diagonal. A solution 
for B is then given by the equation 


B= (X'WX) X'WY 
where (x'wx) isa generalized g2 inverse of X WX. 


Predicted Values and Residuals 
Predicted values for each observation are given by y; = xB. 


The vector of residuals r is defined as r; = y; — j;,i=1,...n. 


n 
bow 2 
The residual sum of squares is: r Wr = S- Ww; ( Yi—-X B) 


i=1 
Algorithm 


Estimation begins with construction of the weighted sum-of-squares and crossed products (SSCP) 
matrix. Let z ; = (x6 Yi ) be the ith row of the matrix Z. Then the SSCP matrix is computed by 


n 
ZWZ= ~ WiZiZ 
i=1 
where z;z ; is the outer product for the vector z;. This matrix can be partitioned as follows 
YWxX YWwY 


ZWZ= fe wx epee 
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After applying the sweep operator to the first p rows and columns of the matrix above, we obtain 
the following solution matrix 


X'WX) isa generalized g2 inverse of X WX, B is a parameter solution, and r Wr is the 
residual sum of squares. 


When a column of X WX is found to be dependent on previous columns, the corresponding 
parameter is treated as redundant. The solution for redundant parameters is set to 0 as well as the 


corresponding rows and columns in (x'wx) i” 


Variance Estimates 


Variances of parameter estimates are computed according to the Taylor linearization method as 
presented by Binder (1983). 


Define the vector d; = x; (uy = x'B) for i=1,...,n and its total population estimate by 


n 
dy = y WX; (ui —x B) 
i=1 


Let V ( dy) be its sample design-based covariance matrix. See “Complex Samples: Covariance 


Matrix of Total” for more information on its computation. Then the covariance matrix of B is 
estimated by 


Vv (B) =(x'Wx) WV (dr) (x’Wx) 


Note: If any diagonal element of Vv (dr) happens to be non-positive due to the use of the 
Yates-Grundy-Sen estimator, all elements in the corresponding row and column are set to zero. 


Subpopulation Estimates 


, 


When analyses are requested for a given subpopulation S, we redefine (x's, vi) as follows: 
(x' ) (x is vi) if the ith element is in $ 

Yi) = ; 

a (Os: se 0) otherwise 


When computing point estimates, this substitution is equivalent to including only the 
subpopulation elements in the calculations. This is in contrast to computing the variance estimates 
where all elements in the sample need to be included. 
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Standard Errors 


Let B; denote a non-redundant parameter estimate. Its standard error is the square root of its 
estimated variance: 


sr (a) =,/v (a) 


Standard error is undefined for redundant parameters. 


Degrees of Freedom 


The sample design degrees of freedom v is used for computing confidence intervals and test 
statistics below and is calculated as the difference between the number of primary sampling units 
and the number of strata in the first stage of sampling. Alternatively, v may be specified by the user. 


Confidence Intervals 


A level 1—a confidence interval is constructed for a given 0 < a < 1 for each non-redundant 
model parameter. Confidence bounds are given by 


B, + SE (B.) ty (1 —a/2) 


where ¢,, (1 — a/2) is the 100 (1 — a/2) percentile of the t distribution with v degrees of freedom. 


t Tests 


The hypothesis test Ho; : B; = Ois performed for each non-redundant model parameter using 
the t test statistic: 


The p-value for the two-sided test is given by the probability ? (\7 ie C (4;) i where T is a 


random variable from the t distribution with v degrees of freedom. 


Design Effects 


The design effect for each non-redundant parameter estimate is given by 


Defi (Bi) = 
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V (4:) is the estimate of variance of B; under the complex sampling design, while Vins (4:) is 


the estimate of variance of B; under the simple random sampling assumption. The latter is 
computed as the ith diagonal element of the following matrix: 


Vrs (Bt) = [(X'W) Vr (de) (XW) ] 


a 


where 


a op N 
Vors(dr) = (fee) =e wid; 


with d; as specified earlier. 
Assuming sampling without replacement we have fpc = ( - 2) given that ~ < 1, while for 
sampling with replacement we set /?¢ = 1. This assumption is independent of the sampling 


specified for the complex sample design based variance V (de i" 


For subpopulation analysis dj = 0 whenever observation i does not belong to a 
given subpopulation. 


We also provide the square root of design effect Dec ff. 


Design effects and their application have been discussed by Kish (1965) and Kish (1995). 


Multiple R-square 


v 
2 
R?=1 r Wr 


(¥-¥s 1) w(yv-¥: 1) 
where Wg = Y</Ng is the estimated subpopulation mean for variable Y. 


If the specified model contains no intercept the following expression is used: 


24 © Wr 
R°=1 Y' wy 


Hypothesis Testing 
Given an rxpL matrix and rx1 K vector, CSGLM tests the linear hypothesis Hy) : LB = K ifLB 
is estimable. The Wald X° statistic is given by 


, 


X? = (LB-k) (LV (B) L') (LB -k) 
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The statistic has an asymptotic chi-square distribution with ry = rank (LV (B) L’) degrees of 


freedom. If rz <r, (LV (B) L’) isa generalized inverse such that Wald tests are effective 
for a restricted set of hypothesis L;B = K, containing a particular subset I of independent rows 
from Ho. 


Each row I’; of L is also tested separately. The estimate for the ith row is given by 1’ ;B and 


its standard error by i/ UV (B)i. 


See “Complex Samples: Model Testing” for additional tests and p-value adjustments. 


Custom Tests 


Custom hypothesis tests are conducted only when L is such that LB is estimable. This condition is 
verified using the following equality: 


L= L(X'WXx) | (x'wx) 


Default Tests of Model Effects 


For each effect specified in the model, a Type III test L matrix is constructed such that LB is 
estimable. It involves parameters only for the given effect and the containing effects and it does 
not depend on the order of effects specified in the model. If such a matrix cannot be constructed, 
the effect is not testable. K is always set to 0 when computing the test statistics for model effects. 


The hypothesis for the corrected model is that all the parameters except for the intercept are zero. 


Estimated Marginal Means 


Estimated marginal means (EMMEANS) are based on the estimated cell means. For a given fixed 
set of factors, or their interactions, we estimate marginal means as the mean value averaged over 
all cells generated by the rest of the factors in the model. Covariates may be fixed at any specified 
value. If not specified, the value for each covariate is set to its overall mean estimate. 


When missing cells are present in the data, EMMEANS may not be estimable. In such 
circumstance, we provide a modified estimate proposed by Searle, Speed and Milliken (1980) 
that ignores the non-estimable cells. 


Each marginal estimate is finally constructed in the form /'B such that ('B is estimable. 
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Comparing EMMEANS 


For a given factor in the model, a vector of EMMEANS is created for all levels of the factor. This 
vector can be expressed in the form /i = LB where each row of L is generated as described above. 
The variance is then computed by the following formula: 


V (ji) =LV (B)L’ 


A set of contrasts for the factor is created according to the selected contrast type. Let this set of 
contrasts define the matrix C used for testing the hypothesis Hy : Cu: = 0 


The Wald X° statistic is used for testing given set of contrasts for the factor as follows: 
X? =(Ca) (CV(@)C’) (Ca) 


The statistic has an asymptotic chi-square distribution with r;degrees of freedom, where 
rp =rank (cv (ji) C’). 


Each row c ; of C is also tested separately. The estimate for the ith row is given by c ;ji and 
its standard error by yeiv (fi)c;. 


See “Complex Samples: Model Testing” for additional tests and p-value adjustments. Substitute 
the following formula for the simple random sampling covariance: Vial jt) =LV ors (B) 


r 


L. 
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CSLOGISTIC Algorithms 


Logistic regression is a commonly used analytical tool for categorical responses. LOGISTIC 
REGRESSION (for binary response) and NOMREG (for multi-category response) are procedures 
under the standard sampling setting. This document considers multinomial logistic regression 
model under the complex sampling setting extending the model in NOMREG to complex 
sampling. 


There are different approaches for analytic inference in complex sampling (Chambers and Skinner 
2003). We will take the two-phase sampling and pseudo-likelihood estimation approaches. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y Categorical dependent variable vector containing values y;, i=1,...,n. 
K The total number of categories for dependent variable. 
i(k ' : if yi = 

yi) Indicator variable for category k; y; (A) = { : cera ’ 

Xx nxp design matrix. The rows correspond to the observations and the 
columns to the model parameters. 

Ti Inclusion probability for case i. 

wi Sampling weight for case i, w; = 1/7. 

Px(k) The probability for response category k at x: px(k}) = Pr(y = k|x), and 
denote p;(h:) = px, (hk) for case i. 

N The number of cases in the whole population. 

n The number of cases in the sample. 

B The parameter of interest, the population or census parameter. 


Superpopulation Model 


Two phases of sampling are assumed. The first phase generates a finite population by a model 
or super population. The second phase selects a sample according to a sampling plan from the 
finite population generated in the first phase. 


Model Generating the Population 


Assume that the response variable y at a given x follows a multinomial distribution with 
probability p,.(/) for y=k. Without loss of generality, let the last category K be the reference 
category. Then for k= 1, ..., K-1, 

Px(h) 


px(K) =x feye 


log 


or 


CSLOGISTIC exp(x Si) 


ea =1,---,A-1 
1+ >». exp (x'3.) 
7 be 1 ‘ =K 
K-1 ; k 
Px(k) e exp (x's, ) 
k=1 
where 3; = (841,°°+5 Gp )’ is the regression parameter vector for response category k. 


There are p(K—1) regression parameters in total. This model is described in many books, for 
example Agresti (2002). 


Let B denote the MLE of the model parameter B based on the whole population. This B is also 
called the census parameter. The parameter of interest is the census parameter B, rather than 
the model parameter B. The exact definition and formulation of B is described below in the 
estimating equation. 


Parameter Estimation 


For a sample drawn from the finite population according to a sample plan, we take the 
pseudo-likelihood approach. In this approach, the pseudo-likelihood is a sample estimate 
of the population log-likelihood, and parameter estimates are derived by maximizing the 
pseudo-likelihood. 


From the sample, an unbiased estimate of population log-likelihood /;) is 


K 
ls (8) = S$ S— wiyi (k) log(pi(h)) 


iC S k=1 


We will maximize /s (3) to get the estimates for census parameter B. The pseudo-score function 
is, fork=1,..., K-1, 


Ss (3) = ys wily; —p;) @ x; 
icS 


The estimator obtained by solving S's (3) =0 is an estimator of B. 


Redundant Parameters 


In this procedure, the over-parameterization approach is similar to that in the NOMREG procedure. 
If a parameter is found to be redundant, it is set to zero and will not affect the estimation procedure. 


Estimation Algorithm 


The Newton-Raphson iterative estimation method is used to solve the estimating equation. Let 
B") be the parameter estimate at iteration step v, the parameter estimate B'"*!) at iteration 
step v + 1 is updated as 
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BY EL) = BY — €-J~(B')S5(B) 


where 
J (8) = 3s) — -\- wi (diag (pi) = pi(P!) ) BX 5 
iES 
the (k, j)th block element of /(:3), for k, j= 1, ..., K-1, is 
Dlwipilk)pii)xixe kG 
_ OSs5(Br) ieS 


—SFwipi(A)(L — pil) xix B= 5 
i€S 


J~(3) isa generalized inverse of J(3). The stepping scalar ¢ > 0 is used to make 
Is (BO*)) > Ls (B'?). Use the step-halving method if!s (B'"~!)) < I¢B')). Let tbe the 
maximum number of steps in step-halving; the set of values of C is {1/2T: r=0, ..., t-1}. 


Starting with initial values B'”), iteratively update B‘”*!) until one of the stopping criteria is 
satisfied. The final estimate is denoted as B. 


Note: Sometimes, infinite parameters may be present in the model because of complete or 
quasi-complete separation of the data (Albert and Anderson, 1984) (Santner and Duffy, 1986). 
In CSLOGISTIC, a check for separation of the data can be performed. If either complete or 
quasi-complete separation is suggested by the test, a warning is issued and results based on the 
last iteration are given. 


Initial Values 


For all non-intercept regression parameters, set their initial values to be zero. For intercepts, if 
there are any, set fork=1,..., K-1, 


(0) oy. Ny, 
By, = log (#«) 


where NV; = yy w;y; (kK) is the estimated population number of responses in category k. 
icS 


Stopping Criteria 


Given two convergence criteria «; >0 and ¢, > 0, the iteration is considered to be convergedif one 
of the following criteria is satisfied: 


|is(B())|+10-6 


i. fale ata (BE) ee if relative change 
is (B&)) —ls (B“)| <e if absolute change 


if relative change 


’) ) <c, if absolute change 
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3. The maximum number of iterations is reached. 


Parameter Covariance Matrix 


The design-based variance of B (Binder 1983) has estimate 


V (B) x J- (B) 7(B) s- (B) 
where [ (,3) is the estimate of design based variance of Sg (3). Letd; = (y * —p*) @ xthen 
Ss5 (3) = oo wily; —p;)@x; = ay w,d; is an estimate for population total of d; vectors. See 


icS iES 
“Complex Samples: Covariance Matrix of Total” for how to calculate the design-based variance 
matrix for the total. 


Confidence Intervals 


The confidence interval for a single regression parameter 4;,; is approximately 


[Br — taf,1 a S¢ (Br;) Bri T taf 1 aah (Bus) 


where se ( B,;) =V Bus) is the estimated standard error of B;.;, and df.1—2 is the 

100 (1 — a/2) percentile of a t distribution with df degrees of freedom. The degrees of freedom df 
can be user specified, and defaults to the difference between the number of primary sampling units 
and the number of strata in the first stage of sampling. 


Design Effect 


For each parameter 5;,;, its design effect is the ratio of its variance under the design to its variance 
under the SRS design, 


Def f (B:.;) _ _V (Bx) 


Vers (Bui) 


For SRS design, the variance matrix is 


Ves (B) re Ja (B) iene (B) Yi (B) 


where 


CSLOGISTIC 


Assuming sampling without replacement we have {pc = (1 = 2) given that + < 1, while for 


sampling with replacement we set fpc = 1. This assumption is independent of the sampling 
specified for the complex sample design based variance matrix { (3). 


Pseudo -2 Log-Likelihood 


For the model under consideration, the pseudo —2 Log Likelihood is 


-215 (B) 


Let the initial model be the intercept-only model if the intercept is in the considered model, or the 
empty model otherwise. For the initial model, the pseudo —2 Log Likelihood is 


—2ls (BO) 


where B\”) is the initial parameter vector used in the iterative estimating procedure. 


Pseudo R Squares 


Let Ly (B) be the likelihood function for the whole population; that is, Lr; (B) = exp (/,, (B)) 
A sample estimate is 7; (B) = exp (Is (B)) 


Cox and Snell’s R Square 


Fe. =1- (40) 2H exp { =as(8 ice th | 


Nagelkerke’s R Square 


R?2 = Res 
NV N 
1-{Lu(BO) } 
McFadden’s R Square 
2S) 
Ru =1- ——~ 
M ls(BO) 


Hypothesis Tests 


Contrasts defined as linear combination of regression parameters can be tested. Given an 
rxp(K-1) L matrix and rx1 K vector, CSLogistic tests the linear hypothesis Ho : LB = K. See 
“Complex Samples: Model Testing” for details. 
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Custom Tests 


For a user specified L and K, 45 :LB=K _ is tested only when LB is estimable. Let 
L = (Li,---, Ly _,), where each L;, is arp matrix. LB is estimable if for every 


icaigh 
where H = (x’x) -X’X is a pxp matrix. 


Note: In NOMREG, only block diagonal matrices such as L = diag (L*,---, L*) are considered, 
where L* is a qXp matrix. Also in NOMREG, testability is not checked. 


Default Tests of Model Effects 


For each effect specified in the model, a matrix L = diag (L*,---, L*) is constructed and 

Ho : LB = 0 is tested. The matrix L* is chosen to be the type III test matrix constructed based 
on matrix H — (x’x) X’xX. This construction procedure makes sure that LB is estimable. It 
involves parameters only for the given effect and the effects containing the given effect. It does 


not depend on the order of effects specified in the model. If such a matrix cannot be constructed, 
the effect is not testable. 


Predicted Values 


For a predictor pattern x, the predicted probability of each response category is 


exp(x B,) 
K-1 
> exp (x'B,) bated 
= 14+ k=1 
x(k) Koi 1 k=k 
14 exp (xB, ) 
k=1 


The predicted category c (x) is the one with the highest predicted probability; that is 
c(xX) = argmaxy px (h') 

Equivalently, 

c(x) = argmax;, (x'B.) 


where Bx = 0 is set for the last (reference) response category. This latter formula is less likely to 
have numerical problems and should be used. 
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Classification Table 
A two-way table with (i,j)th element being the counts or the sum of weights for the observations 


whose actual response category is i (as row) and predicted response category is j (as column) 
respectively. 


Odds Ratio 


The ratio of odds at x, to odds at x» for response category /:; versus /rz is 
Ix, (h1)/px, (ke ; , 
or (x) X23 hry, kg) = Sere = exp (x1 = X») (By, =e B,,)) 
Fork) =k andk2= A (the reference response category), odds ratio is simplified as 


or (X,,Xa;k. AK) = exp (0x1 _ x2) Bx) 


Equation for or (x), x2; ‘4’. A’) will be the one we use to calculate odds ratios. The estimate and 
confidence interval for or (x,, x2; 4’. A’) are respectively 


exp ((x — x») By) 


and 


[exp (¢ _ taf.i— Se (c)) ,eXp (¢ + tapag 2 Se (¢))] 


where 


se (¢) = [os — x) Var(Bx) (x) —X2) 


exp(B) 


exp (B;,;) can be interpreted as an odds ratio for main effects model. SUDAAN calls exp (B;,;) the 
odds ratio for parameter 5;,; whether or not there is an interaction effect in the model. Even 
though they may not be odds ratios for models with interaction effects, they are still of interest. 
For each exp (B;,;), its 1-a confidence interval is 


isn (L(t) (i) 


where L ( By i) JU (Bi i) are the lower and upper confidence limits for census parameter Bj, ;. 
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Subpopulation Estimates 


When analyses are requested for a given subpopulation D, we perform calculations on the 
following redefined x; and y; (k): 


x; = x;d; (PD) 
Yi (k) = yj (A) 0: (D) 


where 


3, (D) = 1 if the sample unitiis in the subpopulation D 
‘~*~ 1-0 otherwise 


When computing point estimates, this substitution is equivalent to including only the 
subpopulation elements in the calculations. This is in contrast to computing the variance estimates 
where all elements in the sample need to be included. 


Missing Values 


Missing values are handled using list-wise deletion; that is, any case without valid data on any 
design, dependent, or independent variable is excluded from the analysis. 
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Complex Samples Ordinal Regression is a procedure for the analysis of ordinal responses using 
cumulative link models and allowing for both categorical and continuous predictors. Models 
specify threshold parameters associated with different response categories in addition to regression 
parameters associated with model predictors. 


Complex sample data must contain both the values of the variables to be analyzed and the 
information on the current sampling design. Sampling design includes the sampling method, strata 
and clustering information, inclusion probabilities and the overall sampling weights. 


Sampling design specification for Complex Samples Ordinal Regression may include up to three 
stages of sampling. Any of the following general sampling methods may be assumed in the first 
stage: random sampling with replacement, random sampling without replacement and equal 
probabilities and random sampling without replacement and unequal probabilities. The first two 
sampling methods can also be specified for the second and the third sampling stage. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Total number of complete records or cases in the dataset. 
wi Overall sampling weight for each sample element in the ith record, i=1.,...,n. 


K The number of values for the ordinal response variable, K>1. 
Y The ordinal response variable taking values coded into integers between land K. 
0 Vector of K—1 population threshold parameters in the cumulative link model. 


3 Vector of p population regression parameters associated with model predictors. 

B Vector of all model parameters B=( 6", 37)T. 

Xx nxp design matrix. The rows correspond to the records and the columns to the 
model regression parameters. The ith row is x/ ,i = 1,...,n. 

Tik Conditional response probability for category given observed independent variable 


vector x;; that is, mix = P(Y = k|x;). 

Vik Conditional cumulative response probability for category given observed 
independent variable vector x;; that is, y;,, = P(Y < k|x:). 

N n 
Total number of elements in the population: N = 23 Wie 


i=1 


Weights 


Overall weights specified for each ultimate element are processed as given. See “Weights ” in 
Complex Samples: Covariance Matrix of Total for more information on weights and variance 
estimation methods. 
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Cumulative Link Model 


Cumulative link models support regression of a single categorical dependent variable 

on a set of categorical or continuous independent variables. The dependent variable Y 

is assumed to be ordinal. Its values have an intrinsic linear ordering and correspond to 
consecutive integers from 1 to K. The cumulative link model links the conditional cumulative 
probabilities P(Y < k|x;),k=1,...,/ —1 toa linear predictor. Threshold parameters 

0, <6) <-+++- < @,_ are assumed different for each cumulative probability, but the vector of 
regression parameters 3 = ((3),..., By) remains the same. The cumulative link model is given 
by the following set of equations: 


link (P(Y < k|x;)) = — px; 


Cumulative link function is specified as an inverse of a cumulative probability distribution 
function as follows: 


log (yin /(1—i4)) Logit link 

log (— log (1 — ¥i.x)) Complementary log-log link 
link (yin) = —log(—log(yi«%)) Negative log-log link 

&~! (44.4) Probit link 

tan(z(4;.,.-0.5)) Cauchit link 


where 4; = P(Y <k|x;) for k=1,....K-1. 


Vector x; denotes a linear model design matrix row matching the vector of regression parameters 
3, Each parameter corresponds to one of the factor main effects, factor interaction effects, factor 
nested effects, covariate effects and factors by covariates interaction effects. For every factor effect 
level occurring in data there is a separate parameter. This results in an over-parametrized model. 


Cumulative link models gained popularity after the publication by McCullagh (1980). Further 
details and examples of these models are given in Agresti (2002). 


Estimation 


Assuming that the entire finite population U = {y;, x;} os has been observed, we can obtain the 
maximum likelihood population parameter estimates for the cumulative model by maximizing the 
following multinomial log-likelihood function 


N K N K 
1(9,3)= log] II (a; 4)%°* = ‘> >», Vib log 7.4 


i=1k=1 i=1 k=1 


where we define indicator variables 


ie 1 ify,=k 
Yk = Q otherwise 


and model probabilities 


ik PY k|xi) Yi,k — Vip k=1,...,4 with Yi.0 =0,%i.K =1. 
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Taking the first derivatives of the log-likelihood function with respect to the model parameters 
(#, 3) and setting them equal to zero, we obtain a set of estimating equations for the population 
model. A solution vector for this set of equations is denoted by (6), 3),). We follow the 
established design-based approach concerned with estimating the implicit finite population 
parameters as described by Binder (1983). Population totals in the estimating equations are 
replaced by their sample-based estimates. A solution for the sample-based estimating equations 
provides estimates for the population parameters @ vi ON ) and these are the estimates that we 


will consider in our analysis. For simplicity, we shall still denote them by (4. 3). 


An equivalent approach for obtaining the estimates 6 : ’) is the pseudo-maximum likelihood 
method where we maximize the sample-based estimate of the log-likelihood given as follows: 


L(0, 3) ae Wi Yi bk lOg 7.4 


i=1 k=1 


See Sarndal et al. (1992) for an overview of designed-based approach in modeling survey data. 


Predicted probabilities 


Given a predictor design vector , the model-predicted probability for each response category is 


Nik = Yisk — Vi,k-1 


0 k=0 
Vik = link”! (6, —£8 x;) k=1,..., | 
l k=K 


Let 77;., = @% — 9 x;. The inverse of the link function; that is, the corresponding cumulative 
distribution function is given by the following formulas: 


exp (7i.4)/(1 + exp(;)) for Logistic link 


1 — exp (— exp (17;..)) for Complementary log-log link 
link! (n.%) =  exp(—exp (—7;.4)) for Negative log-log link 

® (7.4) for Probit link 

0.5 + arctan (7:4) /7 for Cauchit link 


Estimating equations 
Sample-based estimating equations for the population parameters are given by 


S (6,8) = al an — 9 


Ol, 


ai 5 OM: (Vik — Yik4+I . 
Wr yomgee (4 Hatt) bei, .,X=1 


and 
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fall Sos i: OVik OFi.k-1 Yik =irsi ; 
Of ~ 2 ‘Lie Steer) ae eae: 


i=1 k=1 "ak 
where 
ik (1 — Yi) Logit link 
. —(1— 4,4) log (1 — y;,.) Complementary log-log link 
a ~ = 4 —9i,k log (7,4) Negative log-log link 
6 (®7 (v4.4) Probit link 
cos? (7 (9;,4 — 0.5)) /m Cauchit link 


for k=1,...,K—1, and by the definition =" = “= = 0. Note that if 


Oni 


_— then o1k 0 for all link functions. 


Vik = 0 or Yi Oni ke 


sh 


Second derivatives 


The matrix of the first derivatives of the estimated scores § (4, 3) is denoted by Jy(@, 3) and its 
elements are given by the following expressions: 


n 


42 Oi k—-1 OV: Vick ; 
wa. = So wi ; on ~~ k=2,...,.4-1 

ee ST 1 OM TG 

2] ” Orie (Yin Yik+l OY; ? Vik Yi k+d 
bor = wi | ao (— - — ey Se || k= 1,...,4-1 
ae On; \Tigk Wik 41 Omit Mey >) Tepes 

i=1 2h yk sk 

vl _¢ - a 

00,00, = 0, for |j —k| > 1 
n a9 J ’ a 

or Sow Or Vik OF: Ke (Fe sata) Yisk 
D0,08: t a2 (i,k ) a. AS: a Litt 

re am On; Onin KON ON: b=1 oe 

nm favs ‘ ‘ ‘ 

On Yik OVin (OVijn+1 — OF: \ \ Yie+i : 

S Wil Boa Mkt — 5 9 a 5 ti4,k =1,...6 —1t=1,...,p 

= ON; ke OM OM k~1 OM. Ak Me b+] 
n K 4? ao) ‘ 5 2 

i S S ‘i Orvik O° Fik—-1 \ OVin  OV¢¢=1 Vik 

J3,08y t Dan2 2 Nik : ; oii 
=r On: 5, OW 5:4 Oni Oi de—1 a 

tu=l,...,p 


Second derivatives of the cumulative distribution functions are given by 


1— 24:4 Logit link 
1+ log (1 —4;,.) Complementary log-log link 
Sk = SLE « £ _ (1 + log (4:4) Negative log-log link 
—O7! (41,4) Probit link! 
sin (277;.4) Cauchit link 


for k=1,...,K—1, and by the definition Te = aot =0. 
7,0 i,k 
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Expected second derivatives 


The matrix of the expected first derivatives of the estimated scores § (4,3) is denoted by 
J, (9,3) and its elements are given by the following expressions: 


9 OVik-1 OFi,k 1 
sien; => Wi —- K=2,..., HT 
k—-1 k ; Oni b—-1 Oni he Nik 


i_ —0, for |j —k| > 1 


D0 08}, 


ai Sa OVik — OVi,k-1\ 1 OVik+1 — Vik 1 OVik 
UPR ORs "WNROnie On ner) Wik Ones Onn) Tins] Onin” 


i=1 


k=1,...h —1,t=1,...,p 


n K a ‘ 2 
ai foi — OVik=1 De aoe 
hoa, = y y wil ; LAG asthe aH Nye DP 
u - Oni k Oi 1 Tv 


i,k 


When conducting an analysis for a subpopulation D, only records that belong to the subpopulation 
enter the summation in all of the above derivatives formulas. 


Redundant parameters 


Due to our use of the over-parametrized model where there is a separate parameter for every 
factor effect level occurring in the data, the columns of the design matrix are often dependent. 
Collinearities among continuous variables in the data can also occur. To establish the dependencies 
in the design matrix we examine columns of (1, —X ‘. (1, —X) using the sweep operator. When a 
column is found to be dependent on previous columns, the corresponding parameter is treated as 
redundant. The solution for redundant parameters is fixed at zero. 


Parameter estimation 


The vector of estimates of the population model parameters is obtained as a solution B = ( A, ae 
the sample-based estimating equations. It is computed using the Newton-Raphson method, Fisher 
scoring or a hybrid method. The hybrid method consists of applying Fisher scoring steps for a 
specified number of iterations before switching to Newton-Raphson steps. The iteration step is 
described as follows. Given a vector of parameter estimates B'”) at iteration step v, the parameters 
B'’~") at iteration step v + 1 are computed by solving the following equation: 


J (BY) B+) =J (BY) BY) —¢. § (BY) 


“ Jo(B) for Newton-Raphson step 
J,(B) for Fisher scoring step 


The stepping scalar € > 0 is used to ensure that ((B'”*!)) > @(B'”)) and that 7;,, > 0 if 
yir = 1 for every i. Use step-halving € = 1/2", 4 =0,..., 1 — 1 until these conditions are 
satisfied or the maximum number of steps in step-halving M is reached. 
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Sfandeng With initial values B), iteratively update estimates B'"*!) until one of the stopping 


criteria is satisfied. The final vector of estimates is denoted by B. 


Initial values 


n 
Let NV; = | w;yip be the estimated population number of responses in category k = 1,..., A, 


i=l 
n 


and N = Ss w, be the estimated population size. Initial thresholds are then computed according 


to the following formula: 


SEA; 


\) = Link —— for k=1,...,.K—1 


Initial values for all regression parameters are set to zero, i.e. (3, ” =0 for f1,...,p. 


Stopping Criteria 


Given two convergence criteria €; > 0 and €, > 0, the iteration is considered to have converged if 
criterion 1 or 2 is satisfied, and it stops if any of the following criteria are satisfied: 


1. Pseudo-likelihood criterion 


<e¢ if relative change 
[i(B) : 


jie 1)) — i(B)| <e, ifabsolute change 
2. Parameter criterion 


LLLAX j 


<e,  ifrelative change 


max; (a ) _ B' |) <e, if absolute change 
3. The maximum number of iteration, or steps in step-halving is reached. 
4. Complete or quasi-complete separation of the data is established. 


Depending on user’s choice, either relative or absolute change (default) is considered in criterion 1 
and 2. 


If the hybrid algorithm converges with Fisher scoring step, the iterations continue with Newton- 
Raphson steps. 
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Variance estimates 


Variances of parameter estimates are computed according to the Taylor linearization method as 
suggested by Binder (1983). Define vector s, (B) of size (/y — 1 + t) to be the contribution of the 
ith element to the estimating equations as follows: 


(k) (4 IV: k ik ik4 4 
s| (B) =4 A (us weit) fo. ,K-1 
ONi kk Ni,k Nah 1 


and 
K : ; 

(K~1+#) (B) Zs & se | Vr ge 
8; B)= —([5 -5 vi¢,t=1,...57 ) 
; » Oni .k Oni k= 1 Tick : I 
so that 

n 
§(B) = Sows (B) 


The above sum is to be considered as an estimate for the population total of the vectors 8; (B). 


Its sample design-based covariance matrix is denoted bwW (s (B)) . See “Complex Samples: 
Covariance Matrix of Total” for more information on its computation. Then the covariance 
matrix of B is estimated by 


oe) Se) see) 
where J ( B) isa generalized inverse of J (B). 


Note: If any diagonal element of Vv (S (B)) happens to be non-positive due to the use of 
Yates-Grundy-Sen estimator, all elements in the corresponding row and column are set to zero. 


Subpopulation estimates 


, 


When analyses are requested for a given subpopulation D, we redefine ( YisX :) as follows: 


(y a 7 (vi.x's) if the 7 the record is in. D 
no (Osza8s 0) otherwise 


This is to ensure that the contribution to estimates of every element not in subpopulation D is 
zero. When computing point estimates, this substitution is equivalent to including only the 
subpopulation elements in the calculations. This is in contrast to computing the variance estimates 
where all elements in the sample need to be included. 
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Standard Errors 


Let 8; denote a non-redundant parameter estimate. Its standard error is the square root of its 
estimated variance: 


SE (Bi) =,/V (B) 
Standard error is undefined for redundant parameters. 
Degrees of Freedom 
The number of the degrees of freedom df used for computing confidence intervals and test 
statistics below is calculated as the difference between the number of primary sampling units and 


the number of strata in the first stage of sampling. We shall also refer to this quantity as the sample 
design degrees of freedom. Alternatively, df may be specified by the user. 


Confidence Intervals 


A level 1—a confidence interval is constructed for a given 0 < a < 1 for each non-redundant 
model parameter B,;. Confidence bounds are given by 


B,;+SE (B.) ty (0/2) 


where SE (Bi) is the estimated standard error of B,, and ta (1 — a/2)is the 
100 (1 — a/2) percentile of t distribution with df degrees of freedom. 


t Tests 


Testing hypothesis Ho; : B; = 0 for each non-redundant model parameter B; is performed using 
the t test statistic: 


The p-value for the two-sided test is given by the probability P ( T|> 
random variable from the t distribution with df degrees of freedom. 


a where T is a 


t (B,) 


Design Effects 


The design effect Def f ( B,) for non-redundant parameter estimate B; is given by 
Def f (2.) = me) 


Design effects are undefined for redundant parameters. 
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V (4:) is the estimate of variance of B; under the appropriate sampling design, while V,,.. (4:) is 


the estimate of variance of B; under the simple random sampling assumption. The latter is 
computed as the ith diagonal element of the following matrix: 


Vers (B) = 5(8) v...(8 (8) ) i(8) | 


Voers (S (B) ) can be computed by the following formula: 


Vors($ (B) ) = (fpc) = il i : WiSiS j 
j=] 


= 


with s; as specified earlier and N being an estimate of the population size. 


n 


Assuming sampling without replacement we have fpc = (1 - 2) given that ~ < 1, while for 
sampling with replacement we set {pc = 1. This assumption is independent of the eae 
specified for the complex sample design based variance V (S (B) ‘ 


For subpopulation analysis we have that s; = 0 whenever record does not belong to a given 
subpopulation. 


We also provide the square root of design effects. Note that the square root of design effect Deff, 
computed without finite population correction, has been commonly denoted by Deft following 
paper by Kish (1995). Design effects and their application have been discussed since introduction 
by Kish (1965). 


Linear combinations 


Given a constant vector | of the same size as the vector of parameter estimates B, we compute 
variance estimates for the linear combination 1’ B by the formulas: 


v ('B) ='v(B)i 


and 


ie (18) =1V.(8) 


Design effect Def f (" B) for the linear combination /'B is then given by 


Deft (1B) = 2%) 


Vers(UB) 


Pseudo -2 Log Likelihood 


For the model under consideration, the sample-based estimate of the population —2 Log Likelihood 
is 
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-2i (B) 
For initial model, the estimate of the —-2 Log Likelihood is 


=) (B‘°’) 


where B'”) is the initial parameter values used in iterative estimating procedure. 


Pseudo R Squares 


Let L (B) be the likelihood function for the whole population, that is, L(B) = exp (/(B)). A 
sample-based estimate of it is L (B) = exp (/ (B)). 


Cox and Snell’s R Square 
2 L(B N = —21(B'°)—(—21(B) 
Ros =1- ( fe) =1 exp{ )- i 


Nagelkerke’s R Square 


i, = ——_ ese 
. 1-{i(BO)}°™ 


McFadden’s R Square 


a i(B) 
= 1~ qm) 


Hypothesis Testing 


Contrasts defined as linear combinations of threshold and regression parameters can be tested 

. Given matrix L with r rows and A — 1+ p columns, and vector K with r elements, Complex 
Samples Ordinal Regression performs testing of linear hypothesis Hp : LB = K. See “Complex 
Samples: Model Testing” for details. 


Custom tests 


For a user specified L and K. Hy : LB = K is tested only when it is testable: that is. when vector 
LB is estimable. Consider partition L = (L (#),L(3)). where L (@) = (h,..., lx 1) consists of 
columns corresponding to threshold parameters and L (3) be the part of L corresponding to 


regression parameters. Consider matrix Ly = (Ip, L (9)) where the column vectors corresponding 
K-1 


to threshold parameters are replaced by their sum lp > ],,. Then LB is estimable if and only 

k=1 
if Lo = LoH, whereH = (X1.X,) X |X, isa (p+ 1) x (p +1) matrix constructed using 
X = ( Ly —X). 
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Default tests of Model effects 


For each effect specified in the model excluding intercept, type III test matrix L is constructed and 
Hy : LB = Ois tested. Construction of matrix L is based on matrix H = (X , X, Be ae. and 
such that LB is estimable. It involves parameters only for the given effect and the effects 


containing the given effect. It does not depend on the order of effects specified in the model. If 
such a matrix cannot be constructed, the effect is not testable. 


See “Type III Sum of Squares and Hypothesis Matrix” in Sums of Squares for computational 
details on construction of type III test matrices. 


Test of Parallel Lines Assumption 


Consider an alternative model for the specified cumulative link model by allowing different 


regression parameters 3‘) = ier Da Bay gi ) for the first K—-1 response categories: 


link (P(Y < k|x)) = 6; — Bx 


The alternative model then contains parameters with threshold parameters and regression 
parameters. Cumulative link model is a restriction of the alternative model based on the 
assumption of parallel lines corresponding to the following null hypothesis: 


Ay: BY) ee = B(K-1) 


We conduct test of this hypothesis by estimating the parameters of the alternative model and 
applying a Wald type test for LB. = 0 with the contrast matrix L given by 


L! 1) 
L= . 
LW) 
where each L'"),¢ = 1,...p isa (¥ — 2) x (p+ 1) (A — 1) matrix containing pairwise contrasts 


for parameter t between the first and the rest of the regression equations for corresponding 
responses. 


See “Complex Samples: Model Testing” for details of conducting an appropriate Wald test. There 
are several testing options available, but they all require previously computed alternative model 


parameter estimates B.4 as well as their covariance matrix V (B a: For some of the options, 


covariance matrix V,,., (B a) must be computed as well. 


See Peterson and Harrell (1990) for a discussion of the alternative model. 


Estimation of the Alternative Model 


Algorithm applied for computing solution of the alternative model B , is similar to the algorithm 
for the restricted cumulative link model B. The main difference is in computation of estimating 
equations and second derivatives appropriate for the alternative model. They are outlined below. 
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5 Ola v7] A 
Expressions D0, ? 00,00, 


model counterparts. 


and expected we i for j,k=1,...,K—1 are identical to their restricted 


Estimating equations for alternative model regression parameters are given by 


n ‘ 

; Ovin (Vik Vicks . 

Ola _ 7 yi, Gik Gi,k+1 eae = _ = 

pat = s OF = = vjy.=O,k=1,...,4-1t=1,...,p 
= A 


Nik Ni k+1 


Derivatives of the estimated scores for the alternative model are given by: 


m 


vee = So fsa and sen =D fegketiadh=1.- AK -1,t,u=1,...,p 


06; os, fk 


where 
Fig t 5 

wi [Bie Gam — SH (Sin Get — Fix geet) | BE 
wi [60 ome oy £ (5p: FAH — 95), Get )] Mot ee 


VS yt AN yh SH An ee OH Vt SH Dye 


Expected derivatives of the estimated scores for alternative model are given by the following 


expressions: 
e n 2 m 

rv = ye cen and eras = -S- €i,j,ktViug k= 1,...K —1,t,u=1,...,p 
‘ : i=l : i=1 

where 

Cit = 

wi | (Si 7. : Ojk-1 ee = ) (5; +1 as a Ojk on ; ) a mat Lit 


a= lacm, 9,4 aS hie KH 1b 1, gp 


Initial values for threshold and regression parameters in the alternative model are taken as the final 
estimated parameters in the restricted model. 


Solution of the alternative model B 4 is provided as an optional output. 


Predicted Values 


For a predictor design vector x; and estimated parameters (6, 3 ), the predicted probability for 
each response category is denoted by 7;;., i,k, & = 1,... A. The predicted category c(x;,) is the 
one with the highest predicted probability; that is, 


c(x;) = argmaxy,, 7). 
If there is a tie in determining c(x;), choose the category with 


1. higher Np = ve. | Wi¥ik 
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<7 


2. If there is still a tie, choose the one with higher 7. = )/;—) Yik 


3. If there is still a tie, choose the one with lower category number. 


Classification table 


A two-way classification table is constructed with (k,))th element, k,/=1,...,K, being the sum of 
weights w; for the sample elements i whose actual response category is k and predicted response 
category is | respectively. 


Predictions for new or incomplete records 


Predicted probabilities and category are also computed for the records not used in the analysis, 
but having non-missing values for all the model predictors and subpopulation variable if any. An 
additional requirement is that given predictor values could be properly parametrized by using 
only the existing model parameters. 


Cumulative odds ratios 


Given user specified design vectors x, and x», the ratio of cumulative odds at x, to cumulative 
odds at x2 is computed for cumulative logistic link. For response category k=1,...,K—1 


(x1, x2) = SOE kixa)/POY >A) = ayy (_ 3’ (x, — x9) 
OP \X1;%2) = PY <hlxa)/P(Y >kixo) — ©XP |! l 2 


Notice that cumulative odds for this particular link do not depend on the response category k. 
Because of this property, ordinal response model with cumulative logistic link is also called a 
proportional odds model. 

A level 1—a confidence interval for or (x), x2) is given by 


exp fe £SE (c) tap (l—a /2)] 


where 


SE (c) = jo — x») Var (3) (x| — Xa) 


Given a factor we compute odds ratios for all its categories relative to the reference category. Ifa 
covariate is specified, we compute odds ratios for its unit change. Other factors are held fixed at 
their respective reference categories, while other covariates are held fixed at their mean values, 
unless requested differently by the user. 
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CSSELECT Algorithms 


This document describes the algorithm used by CSSELECT to draw samples according to 
complex designs. The data file does not have to be sorted. Population units can appear more than 
once in the data file and they do not have to be in a consecutive block of cases. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


N Population size 
n Sample size 
f Sampling fraction 
hj Hit counts of ith population unit (i=1,...,N). 
M; Size measure of ith population unit (i=1,...,N). 
M N 
Total size. AJ = SD M, 

i=l 

Pi pi = oa is the relative size of ith population unit (i=1,...,N) 
Stratification 


Stratification partitions the sampling frame into disjoint sets. Sampling is carried out 
independently within each stratum. Therefore, without loss of generality, the algorithm described 
in this document only considers sampling from one population. 


In the first stage of selection, the sampling frame is partitioned by the stratification variables 
specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata and 
cluster variables as well as strata variables specified in stage 2. If sampling with replacement is 
used in the first stage, the first-stage duplication index is also one of the stratification variables. 
Stratification of the third stage continues in a like manner. 


Population Size 


Sampling units in a population are identified by all unique level combinations of cluster variables 
within a stratum. Therefore, the population size N of a stratum is equal to the number of unique 
level combinations of the cluster variables within a stratum. When a sampling unit is selected, all 
cases having the same sampling unit identifier are included in the sample. If no cluster variable is 
defined, each case is a sampling unit. 


Sample Size 


CSSELECT uses a fixed sample size approach in selecting samples. If the sample size is supplied 
by the user, it should satisfy 0 < n < N for any without replacement design and n > 0 for any 
with replacement design. 
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If a sampling fraction f is specified, it should satisfy 0 < f < 1 for any without replacement 
design and f > 0 for any with replacement design. The actual sample size is determined by the 
formula n = round (f * N’). When the option RATEMINSIZE is specified, a sample size less than 
RATEMINSIZE is raised to RATEMINSIZE. Likewise, a sample size exceeding RATEMAXSIZE 
is lowered to RATEMAXSIZE. 


Simple Random Sampling 


This algorithm selects n distinct units out of N population units with equal probability; see Fan, 
Muller & Rezucha (1962) for more information. 


@ Inclusion probability of ith unit = n/N 
= Sampling weight of ith = N/n 


Algorithm 
1. If fis supplied, compute n=round(f*N). 
2. Set k=0, i=0 and start data scan. 
3. Geta population unit and set k=k+1. If no more population units are left, terminate. 
4. Testif kth unit should go into the sample. 


Generate a uniform (0,1) random number U. 
If (n —7) /(N —k +1) > U, kth population unit is selected and set i=i+1. 


If i=n, terminate. Otherwise, go to step 3. 


Unrestricted Random Sampling 


This algorithm selects n units out of N population units with equal probability and with 
replacement. 


@ Inclusion probability of ith unit = 1-(1—-1/N)n 
m Sampling weight of ith = N/n. (For use with Hansen-Hurwitz(1943) estimator) 
m Expected number of hits of ith = n/N 


Algorithm 
1. Set i=0 and initialize all hit counts to zero. 
2. Generate an integer k between 1 and N uniformly. 
3. Increase hit count of kth population unit by 1. 
4. Set i=i+1. 
5. Ifi=n, then terminate. Otherwise go to step 2. 
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At the end of the procedure, population units with hit count greater than zero are selected. 


Systematic Sampling 


This algorithm selects n distinct units out of N population units. If the selection interval (N/n) 
is not an integer, an exact fractional selection interval is used. 


m™ Inclusion probability of a unit = n/N 


= Sampling weight = N/n 


Algorithm 
1. Draw a uniform (0,1) random number U. 


2. Population units with indices {i: i=trunc((U+k)*N/n)+1, k=0,...,.n-1} are included in the sample. 


Sequential Sampling (Chromy) 


See the section on PPS sequential sampling. This algorithm is a special case of PPS Chromy with 
all size measures ‘/; equal. 


PPS Sampling without Replacement (Hanurav & Vijayan) 


This algorithm selects n distinct units out of N population units with probability proportional to 
size without replacement. This method is first proposed by Hanurav (1967) and extended by 
Vijayan (1968) to the case of n>2. 


® Inclusion probability of ith unit = np; 
= Sampling weight of ith unit = —- 


m Special requirement: max M/; < “4 


Algorithm (Case 1) 


This algorithm assumes that the population units are sorted by 1/;; that is, 


My, < Mz <... AYy with the additional assumption that Wy_,., Aly. 
1. Compute the probabilities 9; — “#*-"+i1—"* = MStipwontt) j=1,.,.n, where § = a Dh 
2. Select one integer from 1,...,n with probability proportional to 4). 
3. If the integer selected is i, then the last (n—i) population units are selected. 
4. 


Define a new set of probabilities for the first (N—n+i) population units. 
PN=n+1 


ere ee 1<jsN-n +4 


M E : 
5. Define Pj = (Maat ot Miva l,..,N—-—n+i-4 
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10. 


Set m=1 and select one unit from the first (N—n+1) population units with probability proportional to 


ay = ipy 
jn 


aj =nps | [0 —(4 -1) PyJ,fj=2,..,.N—n+1 
k=1 


Denote the index of the selected unit by j,,.. 


Set m=m+1 and select one unit from the (j,—,+ 1)th to (N-n+m)th population units with 
the following revised probabilities 
= (qm 1) PF aad 1 

j-1 


aj =(i —m4 Ll) p; II (1 =e —m) Pr) ,jg=Jm-1 +2,..,.N—-n+m 
K=J yy ~1+1 


jm—-1t1 


Denote the selected unit in step 8 by j,,,. 


If m=i, terminate. Otherwise, go to step 8. At the end of the algorithm, the last (n—i) units and 
units with indices J1)-+-»J: are selected. 


Joint Inclusion Probabilities (Case 1) 


The joint inclusion probabilities of unit i and unit j in the population (1 < 7 < 7 < N) is given by 


r=1 
where 
1 ENSiLSS Woah 
"PN=n+1 if N-n+r>i>N-nandj>N-n-+4r, 
K _ S+rpn—n+1 — 
eo rPi “pyr Te Nae kanal 
ij Scerp Noda if N —-nt> 0 an j>N n Tr, 


_(r) 


7, iff <N—-—n+r. 


(tT) 


m;; ’s are the conditional joint inclusion probgbilities given that the last (n—r) units are selected at 
step 3. They can be computed by the following formula 


my) =r(r—1)(1-P{”) .. (1 PE) PL? 


where 
— Pej INE | 

tr) S+rpn—n4+1 if k = N a 1 

P,hoo= PN-—n+1 if NV hen 
a wN-n+1l<k<N-ne+r 
8+TrTPN—n41 

and 

Pp” = P, 

Bo NBER ate) 


Note: There is a typo in (3.5) of Vijayan(1967) and (3.3) of Fox(1989). The factor (1/2) should 
not be there. See also Golmant (1990) and Watts (1991) for other corrections. 
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Algorithm (Case 2) 


This algorithm assumes that the population units are sorted by /; with the order 


My < M2 <... <My and the additional assumption Wy;_,., = My. 
1. Define the probabilities 


M; a 
Pi = ope Hen N-1 


2. Select one unit from the first (N—n+1) population units with probability proportional to 


ay = NP) 
jt! 


aj = npj || [1 —(n—- 1) Pals =2,..,N—n+1 
k=1 
3. Set m=1 and denote the index of the selected unit by j,,. 


4. Set m=m+1. 


5. Select one unit from the (j,,,-; + 1)th to the (N—n+m)th population unit with probability 
proportional to 
Qj,,-.41 =(m—m+1)pj,, 141 
j-1 
aj =(n—m + 1)p; II (1 — (2 — m) Pr], 9 = Jm-1 + 2,..,N—n +m 


k=jm-1t1 
6. Denote the index of the unit selected in step 5 by j,,. 


7. Ifm=n, terminate. Otherwise, go to step 4. 


At the end of the algorithm, population units with indices j,,...,j,, are selected. 


Joint Inclusion Probabilities (Case 2) 


Joint inclusion probabilities x; ; of unit i and unit j in the population (1 < i < j < V) are given by 
f= n(n=1 =P) alt= Pia) Boe. 


ay 


PPS Sampling with Replacement 


This algorithm selects n units out of N population units with probability proportional to size and 
with replacement. Any units may be sampled more than once. 


@ Inclusion probability of ith unit = 1 — (1 — Pj)" 
= Sampling weight of ith unit = We (For use with Hansen-Hurwitz(1943) estimator) 
i 


m Expected number of hits of ith unit = np; 
Algorithm 


Ny 
1. Compute total size M = doin Mi- 


2. Generate n uniform (0,M) random numbers “13-5 Un, 
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3. Compute hit counts of ith population unit ’; = card {U; : Mj_, < U; < M7,j =1,...,n}, where 
card{} is the number of elements in the set, 7, = 0, and M} = aie 


Luk 


At the end of the algorithm, population units with hit count m,; > 0 are selected. 


PPS Systematic Sampling 


This algorithm selects n units out of N population units with probability proportional to size. If the 
size of the ith unit ./; is greater than the selection interval, the ith unit is sampled more than once. 


@ Inclusion probability of ith unit = np; 
= Sampling weight of ith unit = ap 
m Expected number of hits of ith unit = np;. In order to ensure no duplicates in the sample, the 


condition max V/, < “4 is required. 


eerie 0) 


Algorithm 
i Compute cumulated sizes \W/* = ae _, My. 
2. Compute the selection interval I=M/n. 
3. Generate a random number S from uniform(0,J). 
4. Generate the sequence {S; : S;=S+(j —1)1,j =1,...,.n}, 


5. Compute hit counts of ith population unit h; = card {M7_, < $; < M7,j =1,...,n}, k=1,....N, 
where card{} is the number of elements in the set. 


At the end of the algorithm, population with hit counts h; > 0 are selected. 


PPS Sequential Sampling (Chromy) 


This algorithm selects n units from N population units sequentially proportional to size with 
minimum replacement. This method is proposed by Chromy (1979). 


® Inclusion probability of ith unit = np; 


Sampling weight of ith unit = —- 


NPi 


a 
m= Maximum number of hits of ith unit = trunc(np;) +1 
a 


Applying the restriction max MW; < ‘ensures maximum number of hits is equal to 1. 


Algorithm 


1. Select one unit from the population proportional to its size /;. The selected unit receives a label 
1. Then assign labels sequentially to the remaining units. If the end of the list is encountered, 
loop back to the beginning of the list until all N units are labeled. These labels are the index 
iin the subsequent steps. 
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2. Compute the integer part of expected hit counts I; = trunc(M?) where j7* — > Me? 
i=1,...,N. - 
3. Compute the fractional part of expected hit counts F; = MW; — I;, i=1,...,N. 


4. DefineJyp =0,F)=0 and G0 


5. Set i=1. 

6 If7;-; = Ji-1, goto step 8. 

7. IfT;-; = [;-1+1g0 to step 9. 

8 Determine accumulated hits at ith step (case 1). 


Set 7; = I; 
If F; > F;-1, setT; = T; + 1 with probability (F; — F;_,) / (1 — Fi_,) 
Set i=i+1. 
Ifi > N, terminate. Otherwise go to step 6. 
9. Determine accumulated hits at ith step (case 2). 
Set 7; = I; 
If F; > F;_,, set T= Tye 1 
If Fi, > F,, set T; = T;+ with probability F;/F;_,. 
Set i=i+1. 
If i> N, terminate. Otherwise go to step 6. 


At the end of the algorithm, number of hits of each unit can be computed by the formula 
h; =T; — T;_, i=1,...,.N. Units with m; > 0 are selected. 


PPS Sampford’s Method 
Sampford’s (1967) method selects n units out of N population units without replacement and 
probabilities proportional to size. 
® Inclusion probability of ith unit = np; 
= Sampling weight of ith unit = ap 


m Special requirement: max M/; < 4 


Algorithm 
1. If max M; < it then go to step 2, otherwise go to step 5. 


2. Select one unit with probability proportional to Pi, i=1,...,N. 


: ‘ re : ——_ i=1,...] N. 
3. Select the remaining (n—1) units with probabilities proportional to 7~"?* 
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4. If there are duplicates, reject the sample and go to step 2. Otherwise accept the selected units 
and stop. 


5. Jf ’ = and the /,’s are constant, then select all units in the population and set all sampling 


weights, 1st and 2nd order inclusion probabilities to 1. 


Joint Inclusion Probabilities 


First define the following quantities: 


Pi = 
Ni = Gea! 1,...,N 
N 
c= ye Aj V=1,...0 
k=1 
Lo = Loi; = 1, ij=,....N 


m™m 
Lin = tS (Cai eas m=1,...,n 
k= 


pan 


Lin ij = Lim _ (A; t Aj) Dim—1,ij 7 Ni Aj Lm _2,i35 M=1,.i.51, jal. 


-1 

* kLnk 

K, = y a 

Yn ( nk 
k=1 


Given the above quantities, the joint inclusion probability of the ith and jth population units is 


: us k—n(p; + p;)|\Ln—k.i; 
Tij= KpAidj)_ | J i 


nk-2 
k=2 


PPS Brewer’s Method (n=2) 
Brewer’s (1963) method is a special case of Sampford’s method when n=2. 
PPS Murthy’s Method (n=2) 


Murthy’s (1957) method selects two units out of N population units with probabilities proportional 
to size without replacement. 


N : 
m Inclusion probability of ith unit = p; ( et Ss" Be 


Pi 1—p,. 
hel Pk 


= Sampling weight of ith unit = inverse of inclusion probability 


Algorithm 


1. Select first unit from the population with probabilities p;., k=1,...,N. 
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2. Ifthe first selected unit has index i, then select second unit with probabilities p,./ (1 —p;), k # i. 


Joint Inclusion Probabilities 
The joint inclusion probability of population units i and j is given by 


Nij = Dips (2 — pi — Pj) /A-p)- Pj) 


Saved Variables 
STAGEPOPSIZE saves the population sizes of each stratum in a given stage. 


STAGESAMPSIZE saves the actual sample sizes of each stratum in a given stage. See the 
“Sample Size” section for details on sample size calculations. 


STAGESAMPRATE saves the actual sampling rate of each stratum in a given stage. It 

is computed by dividing the actual sample size by the population size. Due to the use of 
rounding and application of RATEMINSIZE and RATEMAXSIZE on sample size, the resulting 
STAGESAMPRATE may be different from sampling rate specified by the user. 


STAGEINCLPROB saves stage inclusion probabilities. These depend on the selection method. 
The formulae are given in the individual sections of each selection method. 


STAGEWEIGHT saves the inverse of stage inclusion probabilities. 
SAMPLEWEIGHT saves the product of previous weight (if specified) and all the stage weights. 


STAGEHITS saves the number of times a unit is selected in a given stage. When a WOR method 
is used the value is always 0 or 1. When a WR method is used it can be any nonnegative integer. 


SAMPLEHITS saves the number of times an ultimate sampling unit is selected. It is equal to 
STAGEHITS of the last specified stage. 


STAGEINDEX saves an index variable used to differentiate duplicated sampling units resulted 
from sampling with replacement. STAGEINDEX ranges from one to number of hits of a selected 
unit. 
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CSTABULATE Algorithms 


This document describes the algorithms used in the complex sampling estimation procedure 
CSTABULATE. 


Complex sample data must contain both the values of the variables to be analyzed and the 
information on the current sampling design. The sampling design includes the sampling method, 
strata and clustering information, inclusion probabilities and the overall sampling weights. 


The sampling design specification for CSTABULATE may include up to three stages of sampling. 
Any of the following general sampling methods may be assumed in the first stage: random 
sampling with replacement, random sampling without replacement and equal probabilities and 
random sampling without replacement and unequal probabilities. The first two sampling methods 
can also be specified for the second and the third sampling stage. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


H Number of strata. 

Nh Sampled number of primary sampling units (PSU) per stratum. 
Sh Sampling rate per stratum. 

Mhi Number of elements in the ith sampled unit in stratum h. 


Whij Overall sampling weight for the jth element in the ith sampled unit in 


stratum h. 
Vhij Value of variable y for the jth element in the ith sampled unit in stratum h. 
Y Population total sum for variable y. 
n Total number of elements in the sample. 
N Total number of elements in the population. 


Weights 


Overall weights specified for each ultimate element are processed as given. See “Weights” in 
Complex Samples: Covariance Matrix of Total for more information on weights and variance 
estimation methods. 


Z Expressions 


For variables y and y : 


y t 
ehij = Whig Yhijs ® hij = WhigY hij 


Mri Mi 


v a 
“hi = S Zhijs = hi = S z hij 
j=l 


j=l 
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Nh nh 


Be ee ee ee = 
“h Th ~his~ h nh ~ hi 


i=1 t=1 


nh 


2 ’ = r =) 
Si, (uy ) = 4 (Zhi — Zn) (z hi— * n) 
1 


t 


For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for unit 
from h in the same stage. The index j runs over all final stage elements contained in unit hi. 


Variable Total 


An estimate for the population total of variable y in a single-stage sample is the weighted sum 
over all the strata and all the clusters: 


H nth Mni 


an ye ye » Whij Yhij 


h=1 i=1 j=1 
Alternatively, we compute the weighted sum over all the elements in the sample: 


n 


Y = ss Wi Yi 


i=l 


The latter expression is more general as it also applies to multi-stage samples. 


Variables Total Covariance 


For a multi-stage sample containing a with replacement sampling stage, all specifications other 
than weights are ignored for the subsequent stages. They make no contribution to the variance 
estimates. 


Single Stage Sample 


The covariance of the total for variables y and y’ ina single-stage sample is estimated by the 
following: 


h=1 


where U}, (Y : ~) is an estimate contribution from stratum h and depends on the 
sampling method as follows: 


m For sampling with replacement: U}, ( vi: y’) = nn? (v. y') 


m™ For simple random sampling: U}, (v7) =(1— fp) np Ss? ( wy) 


m™ For sampling without replacement and unequal _ probabilities: 


U), (v.¥") = s > (Het — 1) Deeg Sig) (ni — z'nj) 


i=. i>j \ "9 
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7p; and 7;,; are the inclusion probabilities for units i and j in stratum h, and 7,,;; is the joint 
inclusion probability for the same units. This estimator is due to Yates and Grundy (1953) and 
Sen (1953). 


For each stratum h containing a single element, the covariance contribution LU, es y’) is 
always set to zero. 


Two-stage Sample 


When the sample is obtained in two stages and sampling without replacement is applied in the 
first stage, we use the following estimate for the covariance of the total for variablesy and y': 


e(2,7)-a(%,2)=a(%,7)1 5 Ss re) Un (".¥’) 
h=1 i=1 k=1 


where 


7),; is the first stage inclusion probability for the primary sampling unit i in stratum h. In the case 
of simple random sampling, the inclusion probability is equal to the sampling rate /), for stratum h. 


Ky, is the number of second stage strata in the primary sampling unit i within the first stage 
stratum h. 


Unik (¥ , y’) is a covariance contribution from the second stage stratum k from the primary 
sampling unit hi. It depends on the second stage sampling method. The corresponding formula 
given in the “Single Stage Sample” section applies. 


Three-stage Sample 


When the sample is obtained in three stages where sampling in the first stage is done without 
replacement and simple random sampling is applied in the second stage, we use the following 
estimate for the covariance of the total for variables y and : 


ao ae AH mn Bri Nrik Vhiky ae. oh 
CY) = (VY) +O So med fred) YO Uninn (¥2") 
h=li=1  k=1 j=l l=1 


where 
fi is the sampling rate for the secondary sampling units in the second stage stratum hik. 
@ = Lyi; is the number of third stage strata in the secondary sampling unit hikj. 


B Unik ( y. i. is a variance contribution from the third stage stratum / contained in 
the secondary sampling unit hikj. It depends on the third stage sampling method. The 
corresponding formula given in the “Single Stage Sample” section applies. 


Variable Total Variance 


The variance of the total for variable y in a complex sample is estimated by 
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v(¥)=¢ (% y) 


with C (y, y) defined above. 


Population SizeEstimation 


An estimate for the population size corresponds to the estimate for the variable total; it is sum of 
the sampling weights. We have the following estimate for the single-stage samples: 


More generally, 


n 


N= x Ww; 


i=l 


The variance of NV is obtained by replacing ¥);,;; with 1; that is, by replacing z),;; with wy,;; in the 
corresponding variance estimator formula for V (Y } 


Cell Estimates: One-Way Tables 


Sizes 


Let the population be classified according to the values of a single categorical row variable 

and possibly one or more categorical variables in the layer. Categories for the row variable 

are enumerated by r=1,...,R and categories for the layer variables are given by /=1,...,L. Each 
combination of the values (r,l) defines a domain and a cell in the one-way table (r,/), r=1,...,R. For 
each cell we define a corresponding indicator variable: 


ee 1 if the sample unit hij is in the cell (7,/) 
Onig (7,0) = 0 otherwise 


To estimate a cell population size or a table population size, we replace y; with 6; (r,/) in 
the formula for the population total and obtain the following expressions: 
= Cell population size: N (7,/) = ye wd; (rl) 
t=1 
n R 
m Table population size: N (+,/) = Ss S- w;d; (7,1) 


i=1 r=1 
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Similarly, in order to estimate variances of the above estimators, we substitute 1); ; with 
dnij (rl) in the corresponding formula for the whole population. The following substitutions of 
2pij in the formulas for V cg are used for estimating the variances of these estimators: 
m Cell population size: 2,;; (7,1) = Wnijdnij (7,1) 
R 
m Table population size: 2); ; (+,/) = y WnezOneg Ary!) 


r=1 
Proportions 
A table proportion estimate is computed at each layer category as follows: 
Pray (7,1) = N (7,1) /N (4,1) 


This estimator is a ratio and we apply Taylor linearization formulas as suggested by Woodruff 
(1971). The following substitution of z),;; in the formulas for V () are used for estimating the 
variance of the table proportion at a given layer: 


Snig(rD—Snig (+) Pras (7D) 
N(+4,l) 


Phij (rl) = Whij 


Cell Estimates: Two-Way Tables 


Let the population be cross-classified according to the values of a categorical row variable, a 
categorical column variable and possibly one or more categorical variables in the layer. Categories 
for the row variable are enumerated by r=1,...,R while categories for the column variable are 
denoted by c=1,...,C and categories for the layer variables are given by /=1,...,L. Each combination 
of values (r,c,1) defines a domain and a cell in the two-way table (r,c,l) . For each cell we define 
a corresponding indicator variable: 
ee 1 if the sample unit hij is in the cell (7r,c,/) 
ni (SO =) otherwise 
We will also use the following indicator notation: 
c 
™ Row indicator: 6; (r,+,1) = se 6; (r,e,l) 
c=1 
™ Column indicator: 5;(+,c,/) = es 6; (r,c,1) 
r=1 
RC 


m Table indicator: 6; (+,+,/) = a S- 6: (r,e,1) 


r=1c=1 
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Sizes 
To estimate various domain sizes, we substitute x; with 0; in the corresponding formula for the 
whole population as follows: 
™ Cell population size: N (r,c,1) = > wo; (r,c,l) 
w=1 
= Row population size: N (r,+,/) = » w;0; (rT, +, 1) 
i=1 
= Column population size: N (+,c,/) = >s wo; (+,¢,1) 
i=1 
m Table population size: NV (+,+,1) = ie wo; (+,+,1) 
i=1 
Similarly, in order to estimate variance of the above estimators, we substitute y;,;; with 0),;; in 
the corresponding formula for the whole population. The following substitutions of Z;;; in the 
formulas for V (v) are used for estimating variances of: 
m Cell population size: 2),;; (r,c,1) = wnijdnij (",6,1) 
m Row population size: 2;;; (7,+,/) = wnijdnij (r,+,0) 
m= Column population size: 2),;; (+,¢,1) = wnijdnij (+,¢.1) 
m Table population size: z),;;(+,+,/) = wnijdnij (+, +,1) 
Proportions 


We define various proportion estimates to be computed as follows: 

= Row population proportion: Pos (r,c,1) = N(r,c,1)/ N (r,+,1) 
Column population proportion: Prol (r,e,l) = N(r,el )/ N (4,¢,1) 
Table population proportion: P,,,, (r,c,l) = N (r, c,l)/ N(+,+,) 


Marginal column population proportion: Poni (+,¢,1) = N(4,c¢,1)/N (4+,4,1) 


Marginal row population proportion: Prrrow (1; = — N(r, Jl) /N ( 1 4.) 


In order to estimate variances of the above estimators, again apply the Taylor linearization 
formulas as for the one-way tables. The following substitutions of z;; in the formulas for 
¥ ( Y) are used for estimating variances of: 


Snig (re) —dnig (7, +.) Prow(7,¢,1) 
N(r,+,l) 


= Row population proportion: 2),;; (r,c,1) = whi; 


on ig(r,c,L y—d), ig (+ 6,1) Peot(rye,l) 
N(+,c.l) 


= Column population proportion: 2),;; (r,¢,/) = wnij 


F ? bien et) bps Pa el 
m Table population proportion: 2),;; (r,c,1) = wp); ee Pie) 


N(+,+./) 
. . + r) sc,l)—d pj wt) Pmeotl+.cl 
= Marginal column population proportion: =),;; (+,c,/) = wp); 42 i a Gta, 


Snig(v,+,1)—Snij (4,411) Pmrow(r, 451) 
N(+,+,0) 


m= Marginal row population proportion: z),;; (r,+,/) = wnij 
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Standard Errors 


Let Z denote any of the population or subpopulation quantities defined above: variable total, 
population size, ratio or mean. Then the standard error of an estimator Z is the square root of its 
estimated variance: 


StdError (2) =4/V (2) 


Coefficient of Variation 


The coefficient of variation of the estimator Z is the ratio of its standard error and its value: 


CV (2) a = 


The coefficient of variation is undefined when Z = 0. 


Confidence Limits 


A level 1—a confidence interval is constructed for a given 0 < a < 1 for any domain size 
Na defined earlier. The confidence bounds are defined as 


Ny +SE (Nu) ty (1 —a/2) 


where SE ( Na) is the estimated standard error of Ny, and t, (1 — a/2) is the 
100 (1 — a/2) percentile of the t distribution with v degrees of freedom. 


Proportions 


For any domain proportion P;, we use the logistic transformation f (p) = In(p/(1—p)) and 
obtain the following 1 — a level confidence bounds for the transformed estimate: 


p, \ , SE(P1) ie 
in (=P) + Spee Cl a/2) 


These bounds are transformed back to the original metric using the logistic inverse 
f-* (y) = exp(y)/(1 + exp (y)) 


Degrees of Freedom 


The degrees of freedom for the t distributions above is calculated as the difference between the 
number of primary sampling units and the number of strata in the first stage of sampling. This 
quantity is also referred to as the sample design degrees of freedom. 
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Design Effects 
Size 
The design effect Deff for a two-way table cell population size is estimated by 


eg — V(N(r,01)) 
De ff -_ Vor .(N(r,¢,1)) 


V (v (r,c, i)) is an estimate of the variance of .V (r,c,/) under the complex sample design, while 


Vay ( N(r,e,l )) is its estimate of variance under the simple random sampling assumption: 
Vers @ (r,¢,1)) = (fee) ——N (r, 6,1) (A _N (r,¢,1)) 


Assuming sampling without replacement we have {pc = (1 = x) given that ~ < 1, while for 


sampling with replacement we set fpc = 1. This assumption is independent of the sampling 
specified for the complex sample design based variance V {NV (7,c,! )) 


Computations of the design effects for the one-way table cells, as well as for the row, column and 
table population sizes are analogous to the one above. 


Proportions 


Deff for a two-way table population proportion is estimated by 


ff V(Prav(r.c.l) ) 
De es — Var (Prav(r.c.l)) 


V (Pras (r,¢, i)) is an estimate of the variance of Pe (r,c,/) under the complex sample design, 


while V,,,, (Pras (rie, l )) is its estimate of variance under the simple random sampling assumption: 


N Prob (Cyt) (1 = Prob Cones 1)) 
n—1 N(+,4,!) 


Vars (Prav (r.0,1)) = (f°) 


with fpc as specified earlier. 


Computations of the design effects for one-way table proportions, as well as for the row, column, 
marginal row and marginal column population proportions are analogous to the one above. 


Design effects for various estimates are computed only when the condition = <1 is satisfied. 
Design effect square root 
We also compute the square root of a design effect \/Dec f/f. 


Design effects and their applications have been discussed by Kish (1965) and Kish (1995). 
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Tests of Independence for Two-Way Tables 


Let the population be cross-classified according to the values of a categorical row variable, a 
categorical column variable and possibly one a more categorical variables in the layer. Categories 
for the row variable are enumerated by r=1,...,R, while categories for the column variable are 
denoted by c=1,...,C. When the layer variables are given we assume that their categories coincide 
with the strata in the first sampling stage. In the following we omit reference to the layers as the 
formulas apply for each stratum separately when needed. 


We use a contrast matrix C defined as follows. Let Ap be the contrast matrix given by 


Ar = [Ir-i|— 1-1] 


Ip_, is an identity matrix of size R-1 and1,_, is a vector with R-1 elements equal to 1. Define 
C to bea RC x (R—1)(C — 1) matrix defined by the following Kronecker product: 


C=ArR®Ac 


Adjusted Pearson Statistic 


Under the null hypothesis, the asymptotic distribution of X? is generally not a chi-square 
distribution, so we perform an adjustment using the following A matrix: 


A= n(C'D3'NID5'C) (c'D31v (P) D;'C) 


P is a vector and Dg is a diagonal matrix of size RC containing elements P(r,c). 


M = D p- PP’| is a multinomial covariance matrix estimating the asymptotic covariance 


of P under the simple random sampling design, while V (P) estimates covariance matrix of 
P under the complex sampling design. 


We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment 


re  § 
PX * tr& 
where 
_ (trA)* 
d 2 trA2 


This statistic has an approximate F (d, di) distribution. Properties of this test are given ina review 
of simulation studies by Rao and Thomas (2003). 
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Adjusted Likelihood Ratio Statistic 


> P(eln pee a 
P(r,+)P(4,c) 


c=1 


R 
G? = 2n 

r=] 
The adjusted likelihood ratio statistic is computed in an analogous manner to the Pearson 
adjustment where A is the same as before and 


where 


-_ (trA)? 
d= trA2 


This statistic has an approximate F (d,d1) distribution. 


Residuals 


Under the independence hypothesis, the expected table proportion estimates are given by 


E(r,c) = P(r, +) P (+,c) and residual are defined as R (r,c) = P (r,c) — EB (r,c). 
Standardized residuals are computed by 


R(r,c) 


V' (RY r.c)) 


where V ( R(r, c)) denotes the estimated residual variance. 


Let M = |D a PP’ | estimate the asymptotic covariance matrix under simple random sampling 


where P and D pare defined as above. X is another contrast matrix specified by 
X= [Ar ea) 1lcllr &) Ac] 


Contrast matrices A; and Ac, as well as the unit vectors 1; and 1¢, are defined as earlier. 
Variance estimates for residuals are obtained from the diagonal of the following matrix: 


Vv (R) = Fi : wIx(x'NIx) *x’| Vv (P) [ s X(x'Nrx) “x’xa] 


Odds Ratios and Risks 


These statistics are computed only for 22 tables. If any layers are specified, they must correspond 
to the first stage strata. 


Let Ny Ly Nj», No, and N. » be the cell population size estimates, Nia, No i; N.,, and N, » be 
marginal estimates and \, , the population size estimate. 
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Estimates and Variances 


The odds ratio is defined by the following expression: 


— NyNo2 
OR = AuXe 


12 21 
Relative risks are defined by 


Noi /No4 


Risk differences are given by 


N No Nie Noo 
D, = 24. — aah Do = Se — 222 
1 Ny Now and 2 Ny Now 


The following substitutions of z;; in the formulas for V (v) are used for estimating variances: 


: Onis (1,1) Oni j (1.2) Oni; (2,1) Oni; (2,2) 
m Odds ratio: 2z),;; (r,c) = wy;; (4 ut mF p bis ) 
hij (7, ) hij Nu Nis Nox Noo x C R 
: : Onij(1.1) Mie Oni (1,2) Oni; (2,1) Noe Oni; (2,2) 
m Riskratio RRy: 2,3; (r,c) = wn; ia Ae EEE hip nigl2,1)Noz 4 Snigl x RR, 
J J NiiNi4 Ni4 Noi Nox. No4 
; ; Sng (1,1) Ni2—6n;; (1,2) N Snij(2,1)Noo—6y,5; (2,2) No 
m Risk difference Dy: zy; (r,¢) = whi; ( nig} a BeSLIELEE, nag aaa nig (2,2) 2) 
’ / N?, N3, 


The estimations of variance for Rf and D» are performed using similar substitutions. 


Confidence Limits 


A level 1—ca confidence interval is constructed for a given 0 < a < 1 for odds ratio, risk ratio 
and risk difference in every table. 


For the odds ratio or risk ratio R we use the logarithm transformation and obtain the confidence 
bounds 


-\  SE(R 7 
In 2) + 74 (1 —a/2) 
These bounds are transformed back to the original metric using the exponential function. No 
transformations are used when estimating confidence bounds for a risk difference D: 


D+SE (2) t, (1 —a/2) 


Tests of Homogeneity for One-Way Tables 


Let the population be classified according to the values of a categorical row variable and possibly 
one a more categorical variables in the layer. Categories for the row variable are enumerated by 
r=1,...,.R. When the layer variables are given we assume that their categories coincide with the 
strata in the first sampling stage. In the following we omit references to the layers as the formulas 
apply for each stratum separately when needed. 
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We study proportions ? (1) = N (1)/N (+). The test of homogeneity consists of testing the 
null hypotheses Hy: P(r) = 1/R for r=1,...,R. 


Adjusted Pearson Statistic 
We perform an adjusted Pearson statistic test for testing the homogeneity. The Pearson test 
statistic is computed according to the following standard formula: 


R 


Cen R(P(r) S 1/R)- 


r=1 


Under the null hypothesis, the asymptotic distribution of X? is generally not a chi-square 
distribution, so we perform an adjustment using the following A matrix: 


A =n(1(P,)) (Po) 


Vv (Po) is the estimated covariance matrix under the complex sample design, while M (Po) is an 
estimated asymptotic covariance matrix under the simple random sampling givenby 


M (Po) = Idiag (Po) = PoP, 


where Po is a vector and diag (Po) is a diagonal matrix of size R-1 containing elements /?(""), 
=1,....R-1. 


We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment 


This statistic has an asymptotic approximate F (d, di) distribution. 


Adjusted Likelihood Ratio Statistic 
The likelihood ratio test statistic is given by 


2 = On SP r)In (RP (1 )) 


r=1 


The adjusted likelihood ratio statistic is computed in an identical way as the adjustment for the 
Pearson statistic: 


F@’= 


iA 


dand A are the same as specified before. This statistic has an asymptotic approximate 
distribution. 
F (d,dv) 
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Complex Samples: Covariance 
Matrix of Total 


This document describes the algorithms used in the complex sampling module procedures 
for estimation of covariance matrix of population total estimates. It contains a more general 
formulation of the algorithms given in CSDESCRIPTIVES and CSTABULATE. 


Complex sample data must contain both the values of the variables to be analyzed and the 
information on the current sampling design. Sampling design includes the sampling method, strata 
and clustering information, inclusion probabilities and the overall sampling weights. 


Sampling design specification may include up to three stages of sampling. Any of the following 
general sampling methods may be assumed in the first stage: random sampling with replacement, 
random sampling without replacement and equal probabilities and random sampling without 
replacement and unequal probabilities. The first two sampling methods can also be specified 

for the second and the third sampling stage. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


H Number of strata. 
Nh Sampled number of primary sampling units (PSU) per stratum. 
fr Sampling rate per stratum. 
Mhi Number of elements in the ith sampled unit in stratum h. 
Whij Overall sampling weight for the jth element in the ith sampled unit in 
stratum h. 
Vhij Values of vector y for the jth element in the ith sampled unit in stratum h. 
yr Population total sum for vector of variables y. 
n Total number of elements in the sample. 
N Total number of elements in the population. 
Weights 


Overall weights specified for each ultimate element are processed as given. They can be obtained 
as a product of weights for corresponding units computed in each sampling stage. 

When sampling without replacement in a given stage, the substitution w,; = 1/7), for unit 
iin stratum h will result in application of the estimator for the population totals due to Horvitz 
and Thompson (1952). The corresponding variance estimator will also be unbiased, is the 
probability of unit i from stratum h being selected in the given stage. 

If sampling with replacement in a given stage, the substitution w,,; = 1/(nj;,py;) yields the 
estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected units 
should be replicated in the data. The corresponding variance estimator will be unbiased. p;,; is the 
probability of selecting unit i in a single draw from stratum h in the given stage. 


Complex Samples: Covariance Matrix of Total 


Weights obtained in each sampling stage need to be multiplied when processing multi-stage 
samples. The resulting overall weights for the elements in the final stage are used in all 
expressions and formulas below. 


Z Expressions 


Zhij = Whij Vhij 


Mri 


Zhi = s Zhij 
j=l 


Nh 


1 
LS ti 
1 1 


Nh 


S? (y) = : mtd, (hi — Zn )(Zhi — Zn) 


, 


For multi-stage samples, index h denotes a stratum in the given stage, and i stands for unit from h 
in the same stage. Index j runs over all final stage elements contained in unit hi. 


Total Estimation 


An estimate for the population total of vector of variables y in a single-stage sample is the 
weighted sum over all the strata and all the clusters: 


Nh Mri 
Vr 25 Y a w hij Vhij 


h=1 i=1 j 


Alternatively, we compute the weighted sum over all the elements in the sample: 


n 
yr= s wiyi 
i=l 


The latter expression is more general as it also applies to multi-stages samples. 


Total covariances 


For a multi-stage sample containing a with replacement sampling stage, all specifications other 
than weights are ignored for the subsequent stages. They make no contribution to the variance 
estimates. 


Complex Samples: Covariance Matrix of Total 


Single stage sample 


Covariance of the total for vector y in a single-stage sample is estimated by the following: 


V(r) =Vi Gr) = ¥_ Up Gr) 


where U;, (¥) is an estimate contribution from stratum h and depends on the sampling method 
as follows: 


For sampling with replacement 
Un (Fr) = nnSF (y) 

For simple random sampling 
Un (Fr) = (1 — fr) rn S? (y) 


For sampling without replacement and unequal probabilities 


Nh Tn 
“ ThiThj 
U; (Vr) = > (=e “= 1) (Zhi —Znj) (Zhi — Znj) 


: Thij 
i=1 i>j hij 


’ 


7p; and 7), ; are the inclusion probability for units i andj in stratum h, and 7;,;; is the joint inclusion 
probability for the same units. This estimator is due to Yates and Grundy (1953) and Sen (1953). 
In some situations it may yield a negative estimate and is treated as undefined. For each stratum h 
containing a single element, the covariance contribution U), (7) is always set to zero. 


Two-stage sample 


When the sample is obtained in two stages and sampling without replacement is applied in the 
first stage, we use the following estimate for the covariance of the total for vector y: 


H mn Kni 
V (9r) = V2 (Fr) = Vi (Gr) + > So ani d— Unix Gr) 
h=11=1 k=1 


7p; is the first stage inclusion probability for the primary sampling unit i in stratum h. In case of 
simple random sampling, the inclusion probability is equal to the sampling rate /;, for stratum h. 


Ky; is the number of second stage strata in the primary sampling unit i within the first stage 
stratum h. 


Unix (Yr) is a covariance contribution from the second stage stratum k from the primary sampling 
unit hi. Its value depends on the second stage sampling method; the corresponding formula 
from “Single stage sample ” applies. 


Complex Samples: Covariance Matrix of Total 


Three-stage sample 


When the sample is obtained in three stages where sampling in the first stage is done without 
replacement and simple random sampling is applied in the second stage, we use the following 
estimate for the covariance of the total for vector y: 


“d . H np NMhik Lhik 
V(yr) = V2l¥r +S me fhik Sy Unikji (Vr) 
h=1 i=1 g=11=1 


frit is the sampling rate for the secondary sampling units in the second stage stratum hik. 
L),;x; is the number of the third stage strata in the secondary sampling unit hikj. 
Unizji (Yr) is a covariance contribution from the third stage stratum / contained in the secondary 


sampling unit hikj. Its value depends on the second stage sampling method; the corresponding 
formula from “Single stage sample ” applies. 


Total variance 


Variance of the total estimate for the rth element of the vector ¥7, is estimated by the rth diagonal 
element of the covariance matrix for yr 


Population Size Estimation 


An estimate for the population size corresponds to the estimate for the variable total; it is sum of 
the sampling weights. We have the following estimate for the single-stage samples: 


More generally, 


n 
N= ; Ww; 
i=1 


Variance of NV is obtained by replacing y;,;; with 1, ie. by replacing z,,;; with w,,;; in the 
corresponding variance estimator formula for V (77). 
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Complex Samples: Model Testing 


This document describes the methods used for conducting linear hypothesis tests based on the 
estimated parameters in Complex Samples models. 


Required input is a set of the linear hypothesis, parameter estimates and their covariance matrix 
estimated for the complex sample design. Some methods require an estimate of the parameter 
covariance matrix under the simple random sampling assumption as well. Also needed is the 
number of degrees of freedom for the complex sample design; typically this will be the difference 
between the number of primary sampling units and the number of strata in the first stage of 
sampling. 


Given consistent estimates of the above constructs, no additional restrictions are imposed on 
the complex sample design. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Pp Number of regression parameters in the model. 

r The number of linear hypotheses considered. 

L rxp generalized linear hypothesis matrix. 

K rx1 vector of hypothesis values. 

B p*1 vector of population parameters. 

B p*1 vector of estimated population parameters (solution). 

Vv (B) pp estimated covariance matrix for B given the complex sample design. 
Vv 


Sampling design degrees of freedom. 


Hypothesis Testing 
Given L and K, the following generalized linear hypothesis test is performed: 
Hy: LB=K 


It is assumed that LB is estimable. 


Wald Chi-Square Test 


, 


X? = (LB-k) (LV (B)L (LB —K) Koch et al. (1975) 


The statistic has an asymptotic chi-square distribution with r; = rank ( LV (B) L’) degrees of 
freedom. If rz <r, (LV B) L’) isa generalized inverse such that Wald tests are effective 

for a restricted set of hypothesis L;B = K, containing a particular subset I of independent rows 
from Ao. 
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Wald F Test 


F = "1+! x? Fellegi (1980) 


ret 
rv 


This statistic has an approximate asymptotic F-distribution F'(r;,v —r; +1). The statistic is 
undefined if » < r;. See Korn and Graubard (1990) for the properties of this statistic. 


Adjusted Wald Chi-Square Test 


The Wald chi-square statistic under the simple random sampling assumption is given by the 
following expression: 


/ 


x2, = (LB - K) (LV ore (B) i) (LB- K) 


where V.; (B is an asymptotic covariance matrix estimated under the simple random sampling 


assumption. If rank( Ns (B) L’) <r, adjusted Wald tests are effective for a restricted set of 
hypotheses L; B = K; containing a particular subset I of independent rows from Hp. 


Since the asymptotic distribution of 2... is generally not a chi-square distribution, it is adjusted 
using the following matrix: 


A = (LV,,.(B)L') (LV (B)L’) 
where V (B) is an estimated asymptotic covariance matrix under the complex sample design. We 


use second-order adjustment as in Rao and Scott’s (1984) given by 


a) ces 
‘,. = SS 
“adj trAjd 


where 


(trA)? 


Lens. 


This statistic has an approximate asymptotic chi-square distribution with d degrees of freedom. 
See Graubard and Korn (1993) for properties of this statistic in reference to regression problems. 


Adjusted Wald F Test 
Fiaj = =: Rao and Scott’s (1984) 
This statistic has an approximate asymptotic F distribution F (d, di) where d is defined as above. 


See Thomas and Rao (1987) for the heuristic derivation of this test, and Rao and Thomas (2003) 
for a review of the related simulation studies. 
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Individual Tests 


Each row U’ of the L matrix may also be tested separately. For such tests, or when the L matrix 
contains a single row, the statistics above simplify as follows: 


2 ('B-K)" 
AY = TB) 
and 


Pe ah = i Fodj 


‘adj = 


The test statistics ¥? and X24 have asymptotic chi-square distributions with 1 degree of freedom. 
The test statistics F and 4; have approximate asymptotic F distributions F' (1,1). The tests are 


undefined if 1’V (B)l is not positive. 


Significance Values 


Given a value of test statistic T and a corresponding cumulative distribution function G as 
specified above, the p-value of the given test is computed as p=1—G(T). 


Multiple Comparisons 


In addition to the testing methods mentioned in the previous section, the hypothesis 

Hy : LB = K canalso be tested using the multiple row hypotheses testing technique. Let 1’; be the 
ith row vector of the L matrix, and /; be the ith element of the K vector. The ith row hypothesis is 
Ho; : lB =k. Testing Ho is the same as testing multiple hypotheses { H); ya i simultaneously, 
where R is the number of non-redundant row hypotheses. A hypothesis H); is redundant if there 
exists another hypothesis H;, 7 # i such that 1; = cl;,hk; =ckj,c #0. 


For each individual hypothesis Hy; , tests described in the previous section can be performed. Let 
p; denote the p-value for testing Ho;, and p; denote the adjusted p-value. The conclusion from 
multiple testing is, at level o (the family-wise type I error), 


reject Ho; : ';B= kif p; < 
reject Hy) : LB =K if min;(p*) <a 


There are different methods for adjusting p-values. If the adjusted p-value is bigger than 1, it is 
set to 1 in all the methods. 


Sequential Tests. In sequential testing, the p-values are first ordered from the 
smallest to the biggest, and then adjusted depending on the order. Let the ordered p- 
values be 


Pa) SPi2) S++ S Pr): 
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LSD (Least Significant Difference) 
The adjusted p-values are the same as the original p-values: 
Pi =Di 

Bonferroni 
The adjusted p-values are: 
p, = Rp; 

Sidak 
The adjusted p-values are: 
P; =1- (1 — pi)” 

Sequential Bonferroni 


The adjusted p-values are: 


Rp) t=1 
Pi) = : > 4 ) Y, , > 2 
max ((R—i+1)pyu),pj,_1)} 22 


Sequential Sidak 
The adjusted p-values are: 
1-(1—pay)” i=1 


Pi; = R-i+l1 - aoe 
) Lae on pay) a> 2 


Comparison of Adjustment Methods 


A multiple testing procedure tells not only if Hp is rejected, but also if each individual Ho; is 
rejected. All the methods, except LSD, control the family-wise type I error for testing Ho; that is, 
the probability of rejecting at least one individual hypothesis under Hp. In addition, sequential 
methods also control the family-wise type I error for testing any subset of Ho. 


LSD is the one without any adjustment, it rejects 1 too often. It does not control the family-wise 
type I error and should never be used to test Hy. It is provided here mainly for reference. 


Bonferroni is conservative in the sense that it rejects Hy less often than it should. In some 
situations, it becomes extremely conservative when test statistics are highly correlated. 


Sidak is also conservative in most cases, but is less conservative than Bonferroni. It gives the 


exact type I error when test statistics are independent. 
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Sequential Bonferroni is as conservative as the Bonferroni in terms of testing Hy because the 
smallest adjusted p-value used in making decision is the same in both methods. But in term of 
testing individual Ho,, it is less conservative than the Bonferroni. Sequential Bonferroni rejects at 
least as many individual hypotheses as Bonferroni. 


Sequential Sidak is as conservative as the Sidak in terms of testing Hy, but less conservative 
than the Sidak in terms of testing individual H);. Sequential Sidak is less conservative than 
sequential Bonferroni. 
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CTABLES Algorithms 


This document describes the algorithms used in the Custom Tables procedure. 


Weights 
There are two ways in which weighting can be applied in CTABLES: 


1. Frequency or case weighting is specified via a WEIGHT command. Weights specified in 
this manner represent frequency replication (i.e., cases with the same values for all 
variables) and should be positive integers. Non-integer values are accepted and are 
used as specified for descriptive statistics, but sums are generally rounded in computing 
inferential statistics (Standard errors, confidence intervals and test statistics). 


2. Effective sample size or effective base adjustment weighting is specified via a WEIGHT 
subcommand. Weights need only be positive and no rounding is applied at any point in 
computations when using this form of adjustment weighting. 


If the WEIGHT subcommand has been specified, formulas for effective base weighting are used 
and if a WEIGHT command is also in effect, it is silently ignored. If no WEIGHT subcommand is 
specified, formulas for weighted analyses using the WEIGHT command are used. If no 
weighting is in effect, these formulas are used with all weights equal to 1. 


A note on weights and multiple response sets 


Case weights are always based on Counts, not Responses, even when one of the variables is a 
multiple response variable. 
Means and Sums 


This section describes the algorithm used in computing variances, standard errors and 
confidence intervals for the means and sums of scale variables. 


Notation 
C. Unweighted case count in the i-th category, i=1,...,k. 
oc. Population mean of the i-th category, i=1.,...,k. 
ee l-th observation in i-th category, i=1,...,k. 
Wii Weight of the I-th observation in i-th category, i=1,...,k. 
w, Sum of weights in category i, i=1,...,kK. 
w' Rounded sum of weights in category i, i=1,...,k. 


qd; Sum of squared weights in category i, i=1,...,k. 

e, Effective base in category i, i=1,...,k. 

x, Weighted mean of category |, i=1,...,k. 

oo Weighted sum of category i, i=1,...,k 

ge? Weighted variance of category i, i=1,...,k.. 

32 Adjusted variance of category i, i=1,...,k incorporating effective base. 
Si 

> Estimated standard error of the mean of category i, i=1,...,k. 
SX; 

: Estimated standard error of the sum of category i, i=1,...,k. 
(1- ()% Confidence interval coverage level supplied by the user. 


Conditions and assumptions 


User and system missing values of scale variables are excluded. 


Algorithm 


Means and Sums 


Basic weighted statistics 


Weighted sum of i-th category: x,, = > Wyk q : 
I=1 


¢, 
pz Wii 
i=l 
w. 


Weighted mean of i-th category: x, = 


Wy (Xa —¥;) 
Weighted variance of i-th category: A - 
Ww; = 


Weighted standard deviation of the i-th category: s, =1/s; - 
Standard errors with WEIGHT command in effect 


Estimated standard error of weighted mean of i-th category: 
is Si 


sx =>. .- 
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Estimated standard error of weighted sum of i-th category: 


Sx, =,|w,s, : 
Adjusted weighted statistics and standard errors with effective base weighting 


Effective base of i-th category: 4; 


Adjusted variance estimate of i-th category incorporating effective base: 


é. “i ? a 2 
T ie —— = ~ = = 
i = S W,(X, —X;) y Wy (Xy —X;) 


* Wa z— _ (w,-)s; 
a = 7 Ww. 7 Ww. 
= w,-— w,-— 
e, e, 
A 
. Si 
sz = — 
Estimated standard error of weighted mean of i-th category: 7 


Estimated standard error of weighted sum of i-th category: 


Si 


Confidence interval for weighted mean with WEIGHT command 


(1- @ )% confidence interval for population mean yu, of i-th category: 


A 
X;+ bgiawa SE, 


where f, is the value for a Student's ¢ distribution with df degrees of freedom 
that exceeds a % of the distribution. 


Confidence interval for weighted mean with effective base weighting 
(1- @ )% confidence interval for population mean 4; of i-th category 
incorporating effective base: 


A 


Xt bang a 5% 


where f, 4, is the value for a Student's ¢ distribution with df degrees of freedom 
that exceeds a@ % of the distribution. 


Confidence interval for weighted sum with WEIGHT command 


(1- @ )% confidence interval for population sum of i-th category: 


Xt heaiw-a Sx; 


where f, 4 is the value for a Student's ¢ distribution with df degrees of freedom 
that exceeds a % of the distribution. 
Confidence interval for weighted sum with effective base weighting 


(1- @ )% confidence interval for population sum of i-th category: 


Xin baingt Stes 


where f, 4, is the value for a Student's ¢ distribution with df degrees of freedom 
that exceeds a % of the distribution | 


Counts and Percentages 


This section describes the algorithms used in computing adjusted standard errors and 
confidence intervals for counts and percentages for categorical variables. 


Notation 

C Unweighted case count. 

6, Indicator of whether the I-th case is in the i-th category. 

w, Weight of the I-th observation. 

w, Sum of weights in category i. 

W Sum of weights over all categories used in forming the 
proportion/percentage denominator. 

q Sum of squared weights over all categories. 

E Effective sample size or effective base over all categories used in forming 
the proportion/percentage denominator. 

w', Rounded sum of weights in category i. 

W' Rounded sum of weights over all categories used in forming the 
proportion/percentage denominator. 

D; Weighted observed proportion of category i. 
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" Estimated population proportion of category i. 
Pi 
A Estimated standard error of the proportion in category i. 
Sp, 
. Estimated population percentage in category i. 
P;% 
Estimated population count in category i. 
yi 
(1- a )% Confidence interval coverage level supplied by the user. 


Conditions and assumptions 


¢ Cases with user and system missing values of scale variables are excluded. 


Algorithm 


Counts and Percentages 


Basic weighted statistics 
Weighted observed count in i-th category: | _~.. « 
w, = 21° 
I=1 


Weighted observed proportion in i-th category: 


2 w,6, 
l=1 


PP; = W 


Adjusted weighted statistics 


Estimated population proportion in i-th category: p,= p;- 


Estimated standard error of population proportion in i-th category with WEIGHT 
command in effect: 


cz, 
7 iP 1-P, | 
Sp, =\||_——_.. 
P \ = 
Estimated standard error of population proportion in i-th category with WEIGHT 
subcommand in effect: 


Ip.(1-p, | 


Estimated standard error of weighted percentage in i-th category: 


SE(P; %) =100 sp, 


Estimated standard error of population count in i-th category: 
SE(5:)=W' Sp. 
Confidence interval for weighted population proportion 


(1-—@ )% confidence interval for estimated population proportion of i-th category: 


If WEIGHT subcommanad is not in effect: 


A 


Lower bound ( p;) = IDF.BETA(a/2, Wi+.5, W'-w';+.5), 


Upper bound ( p;) = /DF.BETA(1-—a/2,w';+.5 , W'—w',+.5) 


where /DF.BETA is the inverse Beta distribution function. 


If WEIGHT subcommand is in effect: 


A 


Lower bound (p,) = /DF.BETA(a/2, E*p;t+-5, E*(1-p,)+.5), 
Upper bound ( p,;) = /DF.BETA(1-0./2,E* p,+.5,E*(1- p;)+.5) 


where IDF.BETA is the inverse Beta distribution function. 


Confidence interval for weighted population percentage 


A A 


Lower bound ( p, % ) = 100 * Lower bound ( p;) 


Upper bound ( p; % ) = 100 * Upper bound ( p;) 
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Confidence interval for weighted population count 


A 


Lower bound ( >i) = W' * Lower bound (p, ) 


A 


Upper bound ( )):) = W' * Upper bound ( p, ) 


Percentiles 

This section describes the algorithms used in computing percentiles and confidence intervals for 
percentiles for scale variables. The following applies to any cell or marginal of the sub-table 
aside from the sub-table total. Therefore no subscripts for categories or cells are used here. 
Note that the median is the 50" percentile. 


Notation 


Percentile specified by the user divided by 100. 
Proportion of data less than or equal to the 100 * p "percentile. 


p 

P 

W Sum of weights. 
Ww 

q 

W 


‘ Rounded sum of weights. 


Sum of squared weights. 
Cumulative sum of weights for cases less than or equal to the 100 * p" 


percentile. 

w' Rounded cumulative sum of weights for cases less than or equal to the 
100 * p "percentile 

(1- a)% Confidence interval coverage level supplied by the user. 


Conditions and assumptions 


e Cases with user and system missing values of scale variables are excluded. 


Algorithm 


Percentiles 


Percentiles 


Percentiles are computed using the averaged empirical (AEMPIRICAL) method 
documented in the statistical algorithms for the EXAMINE procedure. 


Confidence Intervals for Percentiles 


Confidence intervals for percentiles are computed in a three-step manner, 
adapted from Shah & Vaish (2012) and Woodruff (1952): 


Step 1) Compute the desired percentile. 


Step 2) Fit a binomial confidence interval for the proportion of the data less or 


equal to the estimated percentile ( p ): 


If there is no WEIGHT subcommand, compute: 


P = IDF.BETA(a/ 2, w'+.5 , W'-w't.5 ) 


lower 


P = IDF.BETA(1—a/2, w'+.5, W'-w'+.5 ) 


upper 


where /DF.BETA is the inverse Beta distribution function. 


If there is a WEIGHT subcommand, compute: 

w 
P= —— 
W 


P,.., = IDF.BETA(a/2,E*P+.5,E*(1—P)+.5) 


lower 


P =/IDF.BETA(1-a/2,E*P+.5,E*(1-P)+.5) 


upper 


where E is the effective base or effective sample size, computed as the sum 
of weights squared divided by sum of squared weights: 


Ww? 
q 


i 


and /DF.BETA is the inverse Beta distribution function. 


and P to 


lower upper 


Step 3) Apply the percentile-finding algorithm in step 1 to P 
obtain lower and upper interval bounds for the percentile. 


References: 
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Pearson's Chi-square 


This section describes computation of Pearson's chi-square statistics. 


Notation 


R Number of rows in the sub-table. 
C Number of columns in the sub-table. 
f 


Case weights total in cell (i,j). 


E Marginal case weights total in i-th row. 
C, Marginal case weights total in j-th column. 
W Marginal case weights total in the sub-table. 
q Marginal sum of squared weights in the sub-table. 
EB. Effective sample size or effective base= "— 
q 
Ee, Expected cell counts. 
a Pearson's Chi-Square statistic. 
P 
rv Pearson's Chi-Square statistic adjusted for effective base weighting. 
Bp, Population proportion for cell (i,j). 
D; Marginal population proportion for i-th row. 
Dp, Marginal population proportion for j-th column. 
df Degrees of Freedom. 
Dp p-value of the chi-square test. 
os Significance level supplied by the user. 


Conditions and assumptions 


e Tests will not be performed on Comperimeter tables. 
e Chi-square tests are performed on each innermost sub-table of each layer. 
e If ascale variable is in the layer, that layer will not be used in analysis. 


¢ The row variable and column variable must be two different categorical variables or 
multiple response sets. 

¢ The contingency table must have at least two non-empty rows and two non-empty 
columns. 

¢ Non-empty rows and columns do not include subtotals and totals. 

¢ Empty rows and columns are assumed to be structural zeros. Therefore, R and C are 
the numbers of non-empty rows and columns in the table. 

e — If weighting is in effect, cell statistics must include weighted cell counts or weighted 
simple row/column percents; the analysis will be performed using these weighted cell 
Statistics. If weighting is not in effect, cell statistics must include cell counts or simple 
row/column percents; the analysis will be unweighted. 

¢ Tests are constructed by using all visible categories. Hiding of categories and showing of 
user-missing categories are respected. 


Algorithm 


Pearson's Chi-square 
Hypothesis: H,:p,=p,p, i=1...R and j=1,...,C vs. not H, 


Let E, OG 


Statistic 


Categorical variables in rows and columns 


adh Gert 


ao 


i=l j=l i 


Under the null hypothesis, the statistic has a Chi-square distribution with 
df =(R-1)(C -1) degrees of freedom. 


Categorical variable in rows and multiple response set in columns 


RC - 

a 2 ey Vi - Ey) 
c; 

r=] lp i275, 

W 


Under the null hypothesis, the statistic has an approximate Chi-square 
distribution with df =(R-—1)C degrees of freedom. 


Multiple response set in rows and categorical variable in columns 


2 SEGA B) 


mel pF a-4) 
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Under the null hypothesis, the statistic has an approximate Chi-square 
distribution with df = R(C — 1) degrees of freedom. 


Multiple response sets in rows and columns 


» 2S Gy -Ey 


Under the null hypothesis, the statistic has an approximate Chi-square 
distribution with df = RC degrees of freedom. 


P-value 
p=1-F(z;.4f), 


where F(x;df) is the cumulative distribution function of Chi-square distribution 
with df degrees of freedom. 


The chi-square test is significant if p<a. 


Use of weights 


If the WEIGHT command is used, the case weights (or frequency weights) are 
supposed to be integers representing the number of replications of each case. In 


the chi-square test, we will only check if the aggregated cell counts fi are 
integers. If not, they will be rounded to the nearest integers before computations. 


If the WEIGHT subcommand is used, the fis are treated as effective sample size 


L 


or effective base adjustment weights and need not be integers. The Pearson chi- 
square statistic is computed as indicated above without rounding aggregated cell 
counts to integers, then is adjusted by 


7 > W 5 
Ha = Ess = he: 


Degrees of freedom and p for % are calculated as with “7” . The chi-square test is 
significant if p<a. 


Test statistics for multiple response sets 


In the formulas above we use a variation of the Pearson chi-square test 
statistics developed for a combination of categorical variable and a multiple 
response Set as initially suggested by Agresti and Liu (1999). Formulas and 
properties of this test can be found in a comparative study by Bilder et al. 
(2000). 


An extension of this approach when both variables are multiple response sets is 
given in the paper by Thomas and Decady (2004). It contains a study of the test 
properties as well as additional references. 


References 


Agresti, A. and Liu, I.-M. (1999), “Modeling responses to a categorical variable allowing 
arbitrarily many category choices”, Biometrics, 55, 936-943. 


Bilder, C.R., Loughin, T.M. and Nettleton, D. (2000), “Multiple marginal independence testing 
for pick any/c variables”, Communications in Statistics: Simulation, 29, 1285-1316. 


Thomas, D.R. and Decady, Y.J. (2004), “Testing for association using multiple response 


survey data: approximate procedures based on Rao-Scott Approach”, [International Journal 
of Testing, 4, 43-59. 


Column Means Tests (No Effective Base Weighting) 


This section describes the algorithm used in pairwise comparisons of scale variables over levels 
of a categorical variable or a multiple response set when effective base weighting is not used. 


Notation 
k Number of categories in the sub-table. 
k* Number of categories with case weights greater than or equal to 2. 
Ls Population mean of the i-th category, i=1.,...,k. 
a j-th observation in i-th category, i=1,...,k. 
Wis Case weight of the j-th observation in i-th category, i=1.,...,k. 
Ww, Sum of case weights in category i, i=1,...,k. 
Wy, Rounded sum of case weights in category i, i=1,...,k. 
x, Mean of category i, i=1,...,k. 
. Standard devation of category i, i=1,...,k. 
Si Pooled standard deviation from i-th and j-th categories. 
S, Pooled standard deviation from all categories. 
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Total case weights. Sum of rounded w,'s. 


a Significance level supplied by the user. 


Conditions and assumptions 


¢ Tests will not be performed for Comperimeter tables. 

¢ Tests are performed on each innermost sub-table for each layer. 

¢ Row variable must be a scale variable, possibly nested under or over some categorical 
variables or multiple response sets. Column variable must be categorical or a multiple 
response set. 

e If weighting is on, cell statistics must include weighted means; a weighted analysis will 
be performed using the weighted statistics. If weighting is off, cell statistics mustinclude 
means; an unweighted analysis will be performed. 

¢ Tests are constructed by using all visible, non-empty categories excluding totals and 
sub-totals. Hiding of categories and showing of user-missing categories are respected. 

¢ Total case weights in each category must be at least two. Categories not satisfying this 
assumption are not used. If number of categories satisfying this condition is less than 
two, no comparisons will be made. 

¢ Population variances of all categories are assumed to be equal. 

e User and system missing values of scale variables are excluded. 


Algorithm 
All Pairwise Comparisons 


Hypothesis: Ho;:/4=4;,VS. Hy,: 4,#;,foralli>j. 


* * 


k -1 
Total number of hypotheses: alia (where k - yl (w,€ 2) ) 
2 i=1 


Note that this assumes that a positive variance estimate can be computed using 
the specified method (pooling over all categories or over the two categories 
compared). If the pooled variance estimate using all categories is 0, no 
comparisons will be made. If the pooled variance estimate using only two 
categories is 0, this comparison will not be made and the number of hypotheses 
tested is reduced. 


Aggregated statistics 


The statistics in pairwise comparisons are computed from aggregated category 
means (X,), sample variances (s? ) and sample sizes (W, ), i=1,...,.k. Various 
quantities used in the comparisons are shown below. 


k 
Total case weight (sample size): W = > wT (w, 2 2) 


i=l 


2% 

Mean of i-th category: x, =— 
W. 

be 

>, Wy (%y — %1) 

= 


Sample variance of i-th category: s; = : 
a 


Statisitics for (i,j) comparisons with variance pooled from the two compared 
categories 


et 1 >2 
Assume”; 2 2 and “5 =-. 


Variance pooled from the two compared categories: 


, wW- 1)s; +(W,- 1)s; 


7H ww —? 
W; +W,; « 


t-statistic for comparing levels of a categorical variable: 


P-value p = 2[1- F(|t,|,w,+W,-2)], where F (t, n) is the distribution function 
of t-distribution with n degrees of freedom. 


When multiple response set determines categories there may exist cases that 
belong to both i-th and j-th category. Let Wi be the rounded sum of weights for 
such cases. 


t-statistic for comparing levels of a multiple response set: 


P-value p =2[1-F(|t;|,w,+Ww,-Ww,-2)]. 


A comparison is significant if ? ~~ if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 
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Statistics for (i,j) comparisons with variance pooled from all categories 


Within groups variance pooled from all the categories: 


k 
> Tw, 2 2)0; —1)s; 
2_ ‘a 


WR 
t-statistic for levels of a categorical variable: 


Note: This pooled-variance version of the test is available only for categories 
defined by a categorical variable (it is not available when categories are defined 
by a multiple response variable). 


P-value p= 2[1—-F (| t,|,W- k*)]. 

A comparison is significant if ? <% if no multiple comparison adjustments are 

made. For multiple comparison adjustment formulas see the final section. 
Use of case weights 


The case weights (or frequency weights) are supposed to be integers 
representing number of replications of each case. If the sum of case weights in 
any group ( w,,i=1,...,k) is not an integer, it will be rounded to the nearest integer 
before calculations. Consequently, the total weight W will become the sum of the 
rounded w,'s. 


Column Means Tests (With Effective Base Weighting) 


This section describes the algorithm used in pairwise comparisons of scale variables over levels 
of a categorical variable or a multiple response set when effective base weighting is used. 


Notation 
k Number of categories in the sub-table. 
k* Number of categories with at least two unweighted cases (c;> 2) . 
C; Unweighted case count in the i-th category, i=1.,...,k. 


Li, Population mean of the i-th category, i=1.,...,k. 

a l-th observation in i-th category, i=1,...,k. 

W, Weight of the I-th observation in i-th category, i=1,...,k. 
w, Sum of weights in category i, i=1,...,kK. 

q; Sum of squared weights in category i, i=1,...,k. 

e, Effective base in category i, i=1,...,k. 

X, Weighted mean of category |, i=1,...,k. 

S, Weighted standard deviation of category |, i=1.,...,k. 

A Adjusted weighted standard deviation of category i, i=1,...,k incorporating 
Si effective base. 

P p-value of a test. 


Conditions and assumptions 


e Tests will not be performed for Comperimeter tables. 

e Tests are performed on each innermost sub-table for each layer. 

e The row variable must be a scale variable, possibly nested under or over some 
categorical variables or multiple response sets. The column variable must be categorical 
or a multiple response set. 

¢ Cell statistics must include weighted means. 

¢ Tests are constructed by using all visible, non-empty categories excluding totals and 
sub-totals. Hiding of categories and showing of user-missing categories are respected. 

e In order for two categories to be compared, each must have at least one valid case and 
at least one of the two categories must have at least two valid cases. If no categories 
have more than one valid case, no comparisons will be made. 

¢ Population variances of all categories are assumed to be equal. 

e User and system missing values of scale variables are excluded. 


Algorithm 
All Pairwise Comparisons 


Hypothesis: H),:4,=4 uu; VS. Ayj:4,#4, forall j>i. 


Total number of hypotheses tested: 


* ne k 
“= +k (k—-k’),wherek — sl (c, > _ if the default pooled population 
I=1 


variance estimate is used and assuming that this pooled variance estimate is 
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positive. If the population variance estimate is based on only the two categories 


used in the comparison, k"=yI€ ; 22): 
i=l 
Note that if all categories have at least two valid cases and a positive variance 


estimate can be computed, this reduces to the more familiar a ; 


Aggregated statistics 


The statistics in pairwise comparisons are computed from aggregated category 
means (X,), sample variances ( s’) and sample sizes ( w ), i=1,...,.k. Various 
quantities used in the comparisons are shown below. 


>. Waa 
Weighted mean of i-th category: x, = =. 


W; 


Ya (a -%,)? 
Weighted variance of i-th category: 5; = ——— 
w;- 
Effective base of i-th category: e, = “i 


Adjusted variance estimate of i-th category incorporating effective base: 


e G Cy 
2 / <2 — 2 
> — >, Wi (X, —_ x; ) r? Wi (Xy = X; ) 


at . 2 
— W,; A _n _ (W,—))s; 
@,—1 Ww, Ww, 
i Ww, seen Ww aaa 9 
e; e 


Statistics for (i,j) comparisons with variance pooled from the two compared 
categories 


Pooled variance estimate from the two compared categories: 


a2 _ &, —1)s? +(w,; —1)s; 


Si ee ee 
w. Ww. 
[ Ye laf a 
e,) \ e; 


T-statistic for comparing levels of a categorical variable: 


p-value: p = 2[1— F (| t,|, e;+ e ;— 2)], where F (t, n) is the distribution function of a 
t-distribution with n degrees of freedom. 


A comparison is significant if ? « if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


When a multiple response set determines categories there may exist cases that 
belong to both the i-th and j-th categories. Let w, be the sum of weights for 


7 . ra e. e. 
such cases. The effective base of such cases is then e; = 5* v4 =. 


W; W; 


t-statistic for comparing levels of a multiple response set: 


p-value p=2[1—F (|t,|,e,+e ,;— e,— 2)]. 


A comparison is significant if ? < * if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


Statistics for (i,j)'" comparisons with variance pooled from all categories 


Within groups variance estimate pooled from all the categories: 


gk 
2 YS Le, 22)(w, -Ds; 
i=l 


k w.) 
S'I(c, = of w,-— 


Sw= 


i=l 


t-statistic for levels of a categorical variable: 
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P-value: p = 2[1-F(\t, |. ek )]- 
i=l 


A comparison is significant if ? « % if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


Note: This pooled-variance version of the test is available only for categories 
defined by a categorical variable (it is not available when categories are defined 
by a multiple response variable). 


Column Proportions Tests (No Effective Base Weighting) 


This document describes the algorithm used in computation of column proportions test 
when effective base weighting is not in use. 


Notation 
R Number of rows in the sub-table. 
C Number of columns in the sub-table. 
A, i-th category of the row variable. 
B j-th category of the column variable. 
fi Case weights total in cell (i,j). 
Gi Marginal case weights total in j-th column. 
Cc Rounded marginal case weights total in j-th column. 
Zz z-statistic. 
2 Chi-Square statistic. 
P; Column proportion for cell (i,j). 
B, Estimated column proportion for cell (i,j). 
Dai Estimate of pooled column proportion of j-th and k-th column in i-th row. 
P p-value of a test. 
a The significance level supplied by the user. 


Conditions and assumptions 


e Tests will not be performed on Comperimeter tables and tables with scale variables in 


the layer. 

e Pairwise tests are performed on each row of all eligible innermost sub-tables within each 
layer. 

¢ Sub-tables must have categorical variables or multiple response sets in both rows and 
columns. 


e Number of rows and columns must be larger than or equal to two, i.e. R>2andC>2. 

e Tests are constructed by using all visible categories excluding totals and sub-totals. 
Hiding of categories and showing of user-missing categories are respected. 

e If weighting is on, cell statistics must include weighted cell counts or weighted simple 
column percents; a weighted analysis will be performed. If weighting is off, cell statistics 
requested must include cell counts or simple column percents; an unweighted analysis 
will be performed. 


¢ A proportion will be discarded if the proportion is equal to zero or one, or the sum of 
case weights in a column is less than 2, (i.e. c ;< 2 ). If less than two proportions are left 
after discarding proportions, test will not be performed. 


Algorithm 


All Pairwise Comparisons 


Table layout: 


Bi B2 es Bc 

Ai Pa P12 Pic 

A2 P21 P22 P2c 

Ar Pri Prez ane Pre 
Hypothesis: 


Without loss of generality, we will only look at the i-th row of the table. Let C “be 
the number of categories in the i-th row where the proportion is greater than zero 
and less than one, and where the sum of case weights in the corresponding 
column is at least 2. In the i-th row, C (C “-1)/2 comparisons will be made 


among Pj, Pi2>-- Pic. The (j,k)th hypothesis will be 
FA yx? Py =P VS. A, xt PyF Dix - 
Aggregated statistics 


Column proportions tests are based on the aggregated proportions ( p ;,) and cell 


counts for each column (c ;). Column proportions are computed using the un- 
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fy 
Pi ait 


rounded cell counts “; which are equal to the proportions actually displayed 
in the table output. 


Statistics for the (j,k) comparisons 


Let c, =round(c,)and c, =round (c,) . 


Pooled proportion: 


Z statistic with a categorical variable in the columns: 


(Py — Px ) 


by 


I. a a on 
| Pie 1- Pa (+) 
yr Pi oy - 


J E 


When multiple response set defines columns there may exist cases that belong 
to both j-th and k-th columns. Let ¢ ik be the rounded marginal weights total for 


such cases. 


z statistic with a multiple response set in the columns: 


(Dp; — Px) 


p-value: p= 2[1- 6 (|z |)], where $(z) is the CDF of standard normal 
distribution. 


| . ~ot 1 By 
|PaQ- Pa (—+>- =) 
c; Cc; CC, 


A comparison is significant if ?<% if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


Relationship to Pearson's chi-square tests 


With a categorical variable in the columns, the statistics used in column 
proportion tests are equivalent to the Pearson's chi-square test on a 2 x 
2 table by taking the j and k-th columns and collapsing all rows except 


Column Proportions Tests (With Effective Base Weighting) 


This section describes the algorithms used in computation of column proportions tests when 


the i-th row. Therefore performing column proportion tests on a 2 x 2 
table will give the same result as Pearson's chi-square test. 


Use of case weights 


The case weights (or frequency weights) are supposed to be integers 


representing the number of replications of each case. In column proportions 
tests, we will only check if the column marginal c ,'s are integers. If not, they will 


be rounded to the nearest integers. 


effective sample size or effective base weighting is used. 


Notation 
R Number of rows in the sub-table. 
C Number of columns in the sub-table. 
ng Number of valid unweighted cases in the j-th column of the sub-table. 
CG Number of categories in the i-th row where the number of valid cases in 
the corresponding column is at least 2 (n je 2) 
A, i-th category of the row variable. 
B j-th category of the column variable. 
Wis Sum of weights in cell (i,j). 
w, Marginal sum of weights in j-th column. 
qi Sum of squared weights in cell (i,}). 
ei Effective base in cell (i,j). 
e, Effective base in j-th column. 
t t-statistic. 
P; Column proportion for cell (i,j). 
De Estimated column proportion for cell (i,j). 
Pi Estimate of pooled proportion of j-th and k-th columns in the i-th row. 
P p-value of a test. 


Conditions and assumptions 


Tests will 
the layer. 


not be performed on Comperimeter tables and tables with scale variables in 
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Pairwise tests are performed on each row of all eligible innermost sub-tables within each 
layer. 

Sub-tables must have categorical variables or multiple response sets in both rows and 
columns. 

The number of rows and columns must be larger than or equal to two, i.e. R>2 and 
C>2. 

Tests are constructed by using all visible categories excluding totals and sub-totals. 
Hiding of categories and showing of user-missing categories are respected. 

Cell statistics must include weighted cell counts or weighted simple column percents. 

In order for two categories to be compared, each must have at least one valid case and 
at least one of the two categories must have at least two valid cases. If no categories 
have more than one valid case, no comparisons will be made. 


Algorithm 


All Pairwise Comparisons 


Table layout: 
Ci C2 ae Cc 
Ri Par Pi2 Pic 
R2 P21 P22 Pac 
Rr Pri Prez es Pre 
Hypothesis: 


Without loss of generality, we will only look at the i-th row of the table. The (j,k)th 
hypothesis will be 


Ao: Py =P VS: Ay? Py # Pix for all k > j. 


Let C’ be the number of categories in the i-th row where the number of valid 
unweighted cases is at least two. 


Total number of hypotheses tested: 


C(C -1) 


Cc 
+C (C-—C’),where C — 
; ( ) = pine 2). 


Note that if all categories have at least two valid cases this reduces to aa mee 


Aggregated statistics 


Column proportions tests are based on the aggregated proportions ( p,) and cell 


counts for each column ( w,). Column proportions are computed as J 


Statistics for the (j,k) comparisons 


Pooled proportion estimate: 


W; Py +W, Dit 


ro 
w,t+w, 


Ov) 


Let the effective base for cell (i,j) be é; = ——_ and the effective base for 
q5 


R 
column j be e; = x . Similarly, let the effective base for cell (i,k) be 
i=] 

( W, ) ; R 

=—+— and the effective base for column k be e, = aU” . (When a 
dix i=l 

multiple response set defines the rows, effective base computations over rows 
within a column are not additive. In this case the effective base for column k is 


(Ww, ‘5 


qd 


ex 


computed as e, = ). 


t statistic with a categorical variable in the columns: 


(p; — Pz) 
[. oe 
| Pax 1 — Pye (—+—) 
\ e; & 
p-value: p = 2[1 — F (|t|, df)], F (t, df) is the CDF of Student's t 
distribution with df degrees of freedom. Here df =e ;+e,-2. 


f= 


A comparison is significant if ? <% if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


When a multiple response set defines the columns there may exist cases that 
belong to both the j-th and k-th columns (these cases are said to overlap). Let 


Ww, be the sum of the weights for the cases in row i that belong to both columns 
j and k. Then the effective base of these overlapping cases for columns j and k 
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. oe e é, é. . : 
inrowiis e, = 5* walt - al and the effective base of these overlapping 


Wy We 


R 
cases for columns j and k is e, = by Cum - 


i=l 
t statistic with a multiple response set in the columns: 


= (P; — Px) 
. en | 1 2e, 
| Pi 1- Pix Lt = Ht) 

\ €; G ee 


p-value: p = 2[1-F(|t|.df)], where F(t,df) is the CDF of Student's t 
distribution with df degrees of freedom. Here df = e,+ e,- e,- 2. 


A comparison is significant if ? < % if no multiple comparison adjustments are 
made. For multiple comparison adjustment formulas see the final section. 


If rows are also defined by a multiple response set, then overlapping cases 
cannot be identified individually by rows, so the effective base of the overlapping 


cases for columns j and k is e, =.5*w [eo] , where w,, is the sum of 


w f W, 


the weights of overlapping cases in columns j and k. 


Multiple Comparison Adjustments for Column Means and 
Column Proportions Tests 


This section describes the algorithms used in adjusting p -values or significance levels for 
pairwise comparisons among column means or proportions. 


Notation 
m Number of distinct comparisons performed. 
Pp Unadjusted p-value of a test. 
Ds Bonferroni corrected p-value. 
Dou i Benjamini-Hochberg adjusted p-value for the i” comparison. 
a The significance level supplied by the user. 


Algorithm 


Multiple Comparison Adjustments 


Unadjusted comparisons 


A comparison is significant if p < a. 


Bonferroni adjustment 


If the Bonferroni adjustment for multiple comparisons is requested, the p-value 
p will be adjusted by 


Pz = min(mp,1) . 


A comparison is significant if p,< a. 


Benjamini-Hochberg False Discovery Rate Procedure 


If the Benjamini-Hochberg adjustment for multiple comparisons is requested, the 
method from Benjamini & Hochberg (1995) for controlling the false discovery 
rate (FDR) is used. 


Statistically significant comparisons 
Sort the unadjusted p-values from i= 1,...,m in ascending order. 
Find the largest unadjusted p-value p, for which 


I 
D,<— O. 


m 


Then all comparisons associated with p,= /p,,..., p,are declared significant. 
Adjusted p -values 


The adjusted p-value pz, ;for the i“ comparison is computed as: 


Pex i =P ifi=m 
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: m oe 
Paws = MiN(Pgqj41-— P;) fh i<m- 
1 


Reference: 


Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a 
practical and powerful approach to multiple testing. Journal of the Royal Statistical 
Society B, 57(1), 289-300. 


CURVEFIT Algorithms 


Eleven models can be selected to fit times series and produce forecasts, forecast errors, and 
confidence limits. In all of the models, the observed series is some function of time. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table 34-1 
Notation 
Notation Description 
Y; Observed series; t = 1,...,n 
E(Y¥,) Expected value of Y; 
Y; Predicted value for Y; 
Models 


CURVEFIT allows the user to specify a model with or without a constant term designated by 3p. 
If this constant term is excluded, simply set it zero or one depending upon whether it appears in an 
additive or multiplicative manner in the models listed below. 


Model Description 

(1) Linear E(¥:) = Bo + Bit 

(2) Logarithmic E(Y¥:) = Bo + Bi ln(t) 
(3) Inverse E(¥i) = 89 + fi /t 

(4) Quadratic E(Y:) = Bo + Bit + Bot? 
(5) Cubic E(¥:) = Bo + Bit + Bat? + B30? 
(6) Compound E(Y¥+) = BoB} 

(7) Power E(¥.) = Bot 

(8) S E(¥:) = exp (30 + §1/t) 
(9) Growth E(¥:) = exp (80 + 318) 
(10) Exponential E(¥%:) = Boe®" 

(11) Logistic E(Y:) = (44 Bot)? 


Assumption 


We assume that nonlinear models (6) to (11) can be expressed in linear model form by logarithmic 
transformation. So, for models (6) to (10), 


CURVEFIT 
In (¥;) =In(E(%)) +e 


and for model (11), 


with ¢,,¢ = 1,...,n being independently identically distributed (0,07). 


Application ofRegression 


Each of the models is expressed in linear form and computational techniques described in the 
REGRESSION procedure are applied. The dependent variable and independent variables for each 


model are listed as follows: 


Model Dependent Variable Independent Variables | Coefficients 
(1) Y t Bo, B1 
(2) Y In (#) Bo, Br 
(3) Y 1/t Bo, Pr 
(4) ¥ tt? Bo, Br, Be 
(5) Y a a Bo, B1, B2, 83 
(6) In(Y) t 6, 6r 
(7) In(¥) In(t) 36,1 
(8) In(Y) 1/t Bo, G1 
(9) In(Y) t Bo, Br 
(10) In(Y) t 86, 1 
(11) In(} — 2) t Bo, Bi 


where 35 = In(o) 


and 3 In (,3,). 
The ANOVA table, coefficient estimates and their standard errors, t-values, and significance 


levels are computed as in the REGRESSION procedure. Note that for the nonlinear models 
(6) to (11), we have 


SE (50) = exp (33) xX SE (45 ) 


and 


Ta (41) & exp Ga x SE (33) 
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Predicted Values and Confidence 
Intervals 


The regression coefficients for models (1) to (5) are used to obtain the predicted values. For 
the transformed models, more computations are required to obtain the predicted values for the 
original models. The formulas are listed below: 


Model Description 
QM Y¥,=6)4+ Art 


2) Y= + Ailn(é) 

3) Y%=f0+A,/t 

4) YwW=Ao+fit+ hoe? 

6) Y%=Bo + Bit + Bot? + Ast} 
©) Y¥% = 654 47 

1) Y*=6* +4 In(t) 

(8) Y=8)+61/e 

Q Yi =fo+ft 

(10) ¥f = 65 4 Ait 


(1) ¥, = 6" + Brt 


where Y," = In (x) in models (5) to (10), and Y;* = In (+ — 1) in model (11). 
The 95% prediction interval for an observation at time t is constructed as follows: 


For models (1) to (5): 


Y; + to.o25 J MSE(1+h¢ ++) if constant term is included 


Y; + to.025.\/ MSE(1 + he) otherwise 


For models (6) to (10): 


; 1 
exp ( fig ae ounyfuse( +hy+ 1)) 


and for model (11): 


1 
exp (7 + toorsy/ MSE(I he 1)) ae 
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where MSE is the mean square error obtained by fitting the linear model, ty.92; is the 97.5 
percentage point from Student t-distribution with MSE degrees of freedom, and /:; is the leverage 
(computational detail in the REGRESSION procedure). 


DESCRIPTIVES Algorithms 


DESCRIPTIVES computes univariate statistics—including the mean, standard deviation, 
minimum, and maximum—for numeric variables. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 35-1 
Notation 
Notation Description 
X; Value of the variable for case i 


Wi Weight for case 7 


N Number of cases 

W; Sum of the weights for the first ¢ases 

X; Mean for the first 7 cases 
Moments 


Moments about the mean are calculated recursively using a provisional means algorithm (Spicer, 
1972): 


W? —3wjWj-1 


M} = Mj_, —40jM3_, + 6v7Mj_, 4 * v3Wj-1W; 
J 

M3 = M3_, —3jM?_ 4 a ae 2w,;)v3 

M2? =M2_,+~4 = =1y? 

Xj) =Xj-1 Uj 


Wo = Xo = Mé = M3 = MG =0 


After the last observation has been processed, 


Wy = sum of weights for all cases 
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Basic 


Statistics 


Mean 


XN 

Variance 

S? = M2. /(Wy -1) 
Standard Deviation 


S=vs? 


Standard Error 
ee) 
val Vin 


Minimum 


minX j 
j 


Maximum 


max Xj 
d 


Sum 
X vWn 


Skewness and Standard Error of Skewness 


popes Wy M3 se(( he 6Wy(Wy—1) 
fl = (Wxy—-1)(Wy—-2)S? eg L) (Wy—2)(Wytl(Wn+3) 


If Wy <2 or $?< 10779, and its standard error are not calculated. 


Kurtosis (Bliss, 1967, p. 144) and Standard Error of Kurtosis 
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__ Wa(Ww+1)M4—3M2.M2(Wy-1) _ (AW2=1(SEQ)Y 
92 = ~~ Wy—1)(Wr—2)(Wr—3) 5! se(go) = y/ (Wy—3)(Ww +5) 


If Wy <3 or S2< 1077p and its standard error are not calculated. 


Z-Scores 


If X; is missing or S < 0, 7; is set to the system missing value. 


References 
Bliss, C. I. 1967. Statistics in biology, Volume 1. New York: McGraw-Hill. 


Spicer, C. C. 1972. Algorithm AS 52: Calculation of power sums of deviations about the mean. 


Applied Statistics, 21, 226-227. 


DETECTANOMALY Algorithms 


The Anomaly Detection procedure searches for unusual cases based on deviations from the 
norms of their cluster groups. The procedure is designed to quickly detect unusual cases for data- 
auditing purposes in the exploratory data analysis step, prior to any inferential data analysis. This 
algorithm is designed for generic anomaly detection; that is, the definition of an anomalous case 
is not specific to any particular application, such as detection of unusual payment patterns in the 
healthcare industry or detection of money laundering in the finance industry, in which the 
definition of an anomaly can be well-defined. 


Data Assumptions 


Data. This procedure works with both continuous and categorical variables. Each row represents 

a distinct observation, and each column represents a distinct variable upon which the peer groups 
are based. A case identification variable can be available in the data file for marking output, but it 
will not be used in the analysis. Missing values are allowed. The weight variable, if specified, 

is ignored. 


The detection model can be applied to a new test data file. The elements of the test data must be the 
same as the elements of the training data. And, depending on the algorithm settings, the missing 
value handling that is used to create the model may be applied to the test data file prior toscoring. 


Case order. Note that the solution may depend on the order of cases. To minimize order effects, 

randomly order the cases. To verify the stability of a given solution, you may want to obtain several 
different solutions with cases sorted in different random orders. In situations with extremely large 
file sizes, multiple runs can be performed with a sample of cases sorted in different random orders. 


Assumptions. The algorithm assumes that all variables are nonconstant and independent and 
that no case has missing values for any of the input variables. Each continuous variable is 
assumed to have a normal (Gaussian) distribution, and each categorical variable is assumed to 
have a multinomial distribution. Empirical internal testing indicates that the procedure is fairly 
robust to violations of both the assumption of independence and the distributional assumptions, 
but be aware of how well these assumptions are met. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


ID The identity variable of each case in the data file. 

n The number of cases in the training data Xtrain . 

Xok, k=1, ..., K The set of input variables in the training data. 

Mx, k € {1,..., K} If Xok is a continuous variable, My, represents the grand mean, oraverage of 
the variable across the entire training data. 

SDx, k € {1, ..., K} If Xo, is a continuous variable, SD, represents the grand standard deviation, 
or standard deviation of the variable across the entire training data. 

XK+1 A continuous variable created in the analysis. It represents the percentage of 
variables (k = 1, ..., K) that have missing values in each case. 

X,, k=1,...,K The set of processed input variables after the missing value handling is 


applied. For more information, see the topic “Modeling Stage”. 


DETECTANOMALY Algorithms 


H, or the boundaries of H: 
[Amin max] 


np, h=1,...,H 
ph, h=1, ...,H 


Mnk, k =1,..., K+1, h=1, 
.4 H 


SDnk k € {1,..., K+1},h 
=1,...,H 


{npg} ke (1, ...,K}h= 
TREE ce eae 


anomaly index 
variable contribution 
measure 

PCtanomaly OF Nanomaly 


cutpointanomaly 


Kanomaly 


Algorithm Steps 


H is the pre-specified number of cluster groups to create. Alternatively, the 
bounds [Hmin, Hmax] can be used to specify the minimum and maximum 
numbers of cluster groups. 


The number of cases in cluster h, h = 1, ..., H, based on the training data. 


The proportion of cases in cluster h, h = 1, ..., H, based on the training 
data. For each h, pp = np/n. 

If X, is a continuous variable, Mk represents the cluster mean, or average 
of the variable in cluster h based on the training data. If X, is a categorical 
variable, it represents the cluster mode, or most popular categorical value of 
the variable in cluster h based on the training data. 


If Xx is a continuous variable, SDpx represents the cluster standard deviation, 
or standard deviation of the variable in cluster h based on the training data. 


The frequency set {nhxj} is defined only when Xx is a categorical variable. 
If X, has Jy categories, then npxj is the number of cases in cluster h that fall 
into category j. 


An adjustment weight used to balance the influence between continuous and 
categorical variables. It is a positive value with a default of 6. 


The variable deviation index of a case is a measure of the deviation of 
variable value X, from its cluster norm. 

The group deviation index GDI of a case is the log-likelihood distance d(h, 
s), which is the sum of all of the variable deviation indices {VDIk, k = 1, 
ee ele 

The anomaly index of a case is the ratio of the GDI to that of the average 
GDI for the cluster group to which the case belongs. 


The variable contribution measure of variable X, for a case is the ratio of 
the VDI; to the case’s corresponding GDI. 


A pre-specified value pctanomaly determines the percentage of cases to be 
considered as anomalies. Alternatively, a pre-specified positive integer value 
Nanomaly determines the number of cases to be considered as anomalies. 


A pre-specified cutpoint; cases with anomaly index values greater than 
cutpointanomaly are considered anomalous. 


A pre-specified integer threshold 1<kanomalySK+1 determines the number of 
variables considered as the reasons that the case is identified as an anomaly. 


This algorithm is divided into three stages: 


Modeling. Cases are placed into cluster groups based on their similarities on a set of input 
variables. The clustering model used to determine the cluster group of a case and the sufficient 
statistics used to calculate the norms of the cluster groups are stored. 


Scoring. The model is applied to each case to identify its cluster group and some indices are 
created for each case to measure the unusualness of the case with respect to its cluster group. 
All cases are sorted by the values of the anomaly indices. The top portion of the case list is 
identified as the set of anomalies. 


Reasoning. For each anomalous case, the variables are sorted by its corresponding variable 
deviation indices. The top variables, their values, and the corresponding norm values are presented 
as the reasons why a case is identified as an anomaly. 
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Modeling Stage 
This stage performs the following tasks: 


1. Training Set Formation. Starting with the specified variables and cases, remove any case with 
extremely large values (greater than 1.0E+150) on any continuous variable. If missing value 
handling is not in effect, also remove cases with a missing value on any variable. Remove variables 
with all constant nonmissing values or all missing values. The remaining cases and variables are 
used to create the anomaly detection model. Statistics output to pivot table by the procedure are 
based on this training set, but variables saved to the dataset are computed for all cases. 


2. Missing Value Handling (Optional). For each input variable X,,, k = 1, ..., K, if Xo, isa 
continuous variable, use all valid values of that variable to compute the grand mean Mx and grand 
standard deviation SDx. Replace the missing values of the variable by its grand mean. If Xo, is a 
categorical variable, combine all missing values into a “missing value” category. This category is 
treated as a valid category. Denote the processed form of {Xo} by {Xx}. 


3. Creation of Missing Value Pct Variable (Optional). A new continuous variable, Xx+1, is 


created that represents the percentage of variables (both continuous and categorical) with missing 
values in each case. 


4. Cluster Group Identification. The processed input variables {X,, k = 1, ..., K+1} are used to 
create a clustering model. The two-step clustering algorithm is used with noise handling turned 
on (see the TwoStep Cluster algorithm document for more information). 


5. Sufficient Statistics Storage. The cluster model and the sufficient statistics for the variables 
by cluster are stored for the Scoring stage: 


m The grand mean M, and standard deviation SD, of each continuous variable are stored, 
ke {1,..., K+1}. 

m For each cluster h = 1, ..., H, store the size np. If X, is a continuous variable, store the cluster 
mean Mp, and standard deviation SD; of the variable based on the cases in cluster h. If X, is 
a categorical variable, store the frequency npkj of each category j of the variable based on the 
cases in cluster h. Also store the modal category Mpk. These sufficient statistics will be used 
in calculating the log-likelihood distance d(h, s) between a cluster h and a given case s. 


Scoring Stage 
This stage performs the following tasks on scoring (testing or training) data: 


1. New Valid Category Screening. The scoring data should contain the input variables { X9,,k= 1, 


..., K} in the training data. Moreover, the format of the variables in the scoring data should be the 
same as those in the training data file during the Modeling Stage. 


Cases in the scoring data are screened out if they contain a categorical variable with a valid 
category that does not appear in the training data. For example, if Region is a categorical variable 
with categories IL, MA and CA in the training data, a case in the scoring data that has a valid 
category FL for Region will be excluded from the analysis. 
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2. 


Missing Value Handling (Optional). For each input variable Xo,, if Xo, is a continuous 
variable, use all valid values of that variable to compute the grand mean My and grand standard 
deviation SD,. Replace the missing values of the variable by its grand mean. If Xok is a 


categorical variable, combine all missing values and put together a missing value category. This 
category is treated 


as a valid category. 


Creation of Missing Value Pct Variable (Optional depending on Modeling Stage). If Xx+1 is 
created in the Modeling Stage, it is also computed for the scoring data. 


Assign Each Case to its Closest Non-Noise Cluster. The clustering model from the 
Modeling Stage is applied to the processed variables of the scoring data file to create a cluster 
ID for each case. Cases belonging to the noise cluster are reassigned to their closest non-noise 
cluster. See the TwoStep Cluster algorithm document for more information on the noise cluster. 


Calculate Variable Deviation Indices. Given a case s, the closest cluster h is found. The variable 
deviation index VDI; of variable X;, is defined as the contribution dj,(h, s) of the variable to its 


log-likelihood distance d(h, s). The corresponding norm value is Mp, which is the cluster sample 
mean of X, if X; is continuous, or the cluster mode of X, if X, is categorical. 


Calculate Group Deviation Index. The group deviation index GDI of a case is the log- 
likelihood distance d(h, s), which is the sum of all the variable deviation indices {VDI,, k = 1, 


wy Kt+1}. 


Calculate Anomaly Index and Variable Contribution Measures. Two additional indices are 
calculated that are easier to interpret than the group deviation index and the variable deviation 
index. 


The anomaly index of a case is an alternative to the GDI, which is computed as the ratio of the 
case’s GDI to the average GDI of the cluster to which the case belongs. Increasing values of this 
index correspond to greater deviations from the average and indicate better anomaly candidates. 


A variable’s variable contribution measure of a case is an alternative to the VDI, which is 
computed as the ratio of the variable’s VDI to the case’s GDI. This is the proportional contribution 
of the variable to the deviation of the case. The larger the value of this measure, the greater 

the variable’s contribution to the deviation. 


Odd Situations 


Zero Divided by Zero 


The situation in which the GDI of a case is zero and the average GDI of the cluster that the case 
belongs to is also zero is possible if the cluster is a singleton or is made up of identical cases and 
the case in question is the same as the identical cases. Whether this case is considered as an 
anomaly or not depends on whether the number of identical cases that make up the cluster is large 
or small. For example, suppose that there is a total of 10 cases in the training and two clusters are 
resulted in which one cluster is a singleton; that is, made up of one case, and the other has nine 
cases. In this situation, the case in the singleton cluster should be considered as an anomaly as it 
does not belong to the larger cluster. One way to calculate the anomaly index in this situation is to 
set it as the ratio of average cluster size to the size of the cluster h, which is: 


n/H 
Th 
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Following the 10 cases example, the anomaly index for the case belonging to the singleton cluster 
would be (10/2)/1 = 5, which should be large enough for the algorithm to catch it as an anomaly. 
In this situation, the variable contribution measure is set to 1/(K+1), where (K+1) is the number of 
processed variables in the analysis. 


Nonzero Divided by Zero 


The situation in which the GDI of a case is nonzero but the average GDI of the cluster that the case 
belongs to is 0 is possible if the corresponding cluster is a singleton or is made up of identical cases 
and the case in question is not the same as the identical cases. Suppose that case i belongs to cluster 
h, which has a zero average GDI; that is, average(GDI)}, = 0, but the GDI between case i and 
cluster h is nonzero; that is, GDI(i, h) 4 0. One choice for the anomaly index calculation of case i 
could be to set the denominator as the weighted average GDI over all other clusters if this value is 
not 0; else set the calculation as the ratio of average cluster size to the size of cluster h. That is, 


A 
GDI(i,t : ; 
(2.0) if —1 ) ny average(GDI), #0 


al ry (n— 
“ vy enns  average(@DI) . " a eh 


n/H 
Th 


otherwise 


This situation triggers a warning that the case is assigned to a cluster that is made up of identical 
cases. 


Reasoning Stage 


Every case now has a group deviation index and anomaly index and a set of variable deviation 
indices and variable contribution measures. The purpose of this stage is to rank the likely 
anomalous cases and provide the reasons to suspect them of being anomalous. 


1. Identify the Most Anomalous Cases. Sort the cases in descending order on the values of the 
anomaly index. The top pCtanomaly % (or alternatively, the top Nanomaly) gives the anomaly list, 
subject 


to the restriction that cases with an anomaly index less than or equal to cutpointanomaly are not 
considered anomalous. 


2. Provide Reasons for Considering a Case Anomalous. For each anomalous case, sort the 
variables by their corresponding VDI, values in descending order. The top kanomaly variable 
names, its value (of the corresponding original variable Xp), and the norm values are displayed 
as reasoning. 


Key Formulas from Two-Step Clustering 


The two-step clustering algorithm consists of: (a) a pre-cluster step that pre-clusters cases into 
many sub-clusters and (b) a cluster step that clusters the sub-clusters resulting from pre-cluster 
step into the desired number of clusters. It can also select the number of clusters automatically. 


The formula for the log-likelihood distance d(j, s) between 2 clusters j and s is as follows: 
d(j,8) = €j + €s — €<j,s> 


where 
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and 
- vp NT INT T / NT 
Bur = — x) Nokt/ N v log (Noxi/ Ny) 


in which A,> 0 is a positive adjustment included in the formula to avoid the logarithm of zero in 
the calculation. Its value is set as: 


where m is user-specified and set to m = 6 by default, and 7. is the sample variance of variable 
X, over the entire training sample. 


The log-likelihood distance can be computed as follows: 


-A -B 
d(j,s) = SKK" dy (5,8) 


where 


{—N, log (4. _ 63.) — N, log (A; +67.) +4 Nej,s> log (A. +67. és) } /2 
dy (j, 8) = ° . . . = 
{-n; By — Nba + Nexis Bcjs>k} 


depending on whether the corresponding variable Xx is continuous or categorical. 


For more information, see the topic “TWOSTEP CLUSTER Algorithms”. 
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No analysis is done for any subfile group for which the number of non-empty groups is less 
than two or the number of cases or sum of weights fails to exceed the number of non-empty 


groups. An analysis may be stopped if no variables are selected during variable selection or 
the eigenanalysis fails. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 37-1 

Notation 

Notation Description 

g Number of groups 

Dp Number of variables 

q Number of variables selected 

X ijk Value of variable i for case k in group j 
Fir Case weights for case k in group j 
mj Number of cases in group j 

nj Sum of case weights in group j 

n 


Total sum of weights 


Basic Statistics 


The procedure calculates the following basic statistics. 


Mean 
bre ( fix un) (variable 7 in group j) 
k=1 
g mj . ; 
X,. = 3 Pity Ne (variable /) 
J—1k=1 
Variances 


(>: Fir Xin — 


nj X i) 
i= 3 Tat) (variable in group j) 
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Within-Groups Sums of Squares and Cross-Product Matrix (W) 


g mj g mj mj 
wy = > x fir X ijk Xijn- >> e fir Xn) (>: fit Xs) [nj i, = 1, ee o 
k=1 


j=lk=1 j=1 \k=1 


Total Sums of Squares and Cross-Product Matrix (T) 


J=1lk=1 jJ-\lk=1 J-lk=1 


g mi g om; gm; 


Within-Groups Covariance Matrix 


C22". nae 


(n-q) 


Individual Group Covariance Matrices 


(>: Fit Xijk Xijk = ¥Xim) 
(7) k=1 


il (nj—-1) 


Within-Groups Correlation Matrix (R) 


Wit if wiwy > O 
ae Wii WIL 
ri = / 


SYSMIS _ otherwise 


Total Covariance Matrix 


Univariate F and Afor Variable | 


Rex (tii—wii)(r—g) 
vo) wiilg—1) 


with g—1 and n-g degrees of freedom 
A; = ## 


tii 


with 1, g—1 and n-g degrees of freedom 


Rules of Variable Selection 


Both direct and stepwise variable entry are possible. Multiple inclusion levels may also be 
specified. 
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Method = Direct 


For direct variable selection, variables are considered for inclusion in the order in which they are 
written on the ANALYSIS = list. A variable is included in the analysis if, when it is included, 
no variable in the analysis will have a tolerance less than the specified tolerance limit (default 

= 0.001). 


Stepwise Variable Selection 


At each step, the following rules control variable selection: 


Eligible variables with higher inclusion levels are entered before eligible variables with lower 
inclusion levels. 


The order of entry of eligible variables with the same even inclusion level is determined by 
their order on the ANALYSIS = specification. 


The order of entry of eligible variables with the same odd level of inclusion is determined 
by their value on the entry criterion. The variable with the “best” value for the criterion 
statistic is entered first. 


When level-one processing is reached, prior to inclusion of any eligible variables, all 
already-entered variables which have level one inclusion numbers are examined for removal. 
A variable is considered eligible for removal if its F-to-remove is less than the F value for 
variable removal, or, if probability criteria are used, the significance of its F-to-remove 
exceeds the specified probability level. If more than one variable is eligible for removal, that 
variable is removed that leaves the “best” value for the criterion statistic for the remaining 
variables. Variable removal continues until no more variables are eligible for removal. 
Sequential entry of variables then proceeds as described previously, except that after each step, 
variables with inclusion numbers of one are also considered for exclusion as described before. 


A variable with a zero inclusion level is never entered, although some statistics for it are 
printed. 


Ineligibility for Inclusion 


A variable with an odd inclusion number is considered ineligible for inclusion if: 


The tolerance of any variable in the analysis (including its own) drops below the specified 
tolerance limit if it is entered, or 


Its F-to-enter is less than the F-value for a variable to enter value, or 


If probability criteria are used, the significance level associated with its F-to-enter exceeds the 
probability to enter. 


A variable with an even inclusion number is ineligible for inclusion if the first condition above 
is met. 
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Computations During Variable Selection 


During variable selection, the matrix W is replaced at each step by a new matrix W* using the 
symmetric sweep operator described by Dempster (1969). If the first q variables have been 
included in the analysis, W may be partitioned as: 


Wi Wi 
Wa Wa 


where W11 is qxq. At this stage, the matrix W*~ is defined by 


Ww* — lee Wi Wi | = Be We 


WW, W22 — WoW Wie 


In addition, when stepwise variable selection is used, T is replaced by the matrix 7'*, defined 
similarly. 


The following statistics are computed. 


Tolerance 
0 if w;,; =0 
TOL; = 0 ww? /wii if variable is not in the analysis and w,; 40 
—1/(w},w;;) if variable zis in the analysis and w;; ¢ 0. 


If a variable’s tolerance is less than or equal to the specified tolerance limit, or its inclusion in the 
analysis would reduce the tolerance of another variable in the equation to or below the limit, the 
following statistics are not computed for it or any set including it. 


F-to-Remove 


F.= (wit in—a—9+h) 
g—-1) 


ti 


with degrees of freedom g—1 and n—q-g+1. 


F- to-Enter 


(t;,-—w;,; )(n—q-—g) 
F, ~~ w*.(q—1) 


ii \ 


with degrees of freedom g—1 and n—q-g. 


Wilks’ Lambda for Testing the Equality of Group Means 
A= [Wii |/|Ti1| 


with degrees of freedom q, g—1 and n-g. 
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The Approximate F Test for Lambda (the “overall F”), also known as Rao’s 
R (Tatsuoka, 1971) 


> (1-A*)(r/st+1—qh/2) 
iy an A&sqh 
where 
eee 2 

a V q7h? 4 if ¢ h a 

1 # otherwise 
r=n—1-—(q+q)/2 
h=g-1 


with degrees of freedom qh and r/s+1—qh/2. The approximation is exact if q or his 1 or 2. 
Rao’s V (Lawley-Hotelling Trace) (Rao, 1952; Morrison, 1976) 


q 
=-(n-g pp » ws (tit — Wit) 
4i=1 [=1 


When n-g is large, V, under the null hypothesis, is approximately distributed as \? with q(g-1) 
degrees of freedom. When an additional variable is entered, the change in V, if positive, has 
approximately a \” distribution with g—1 degrees of freedom. 


The Squared Mahalanobis Distance (Morrison, 1976) between groups a and 


q q 
D2, = —(n-g)9_ > wi (Xia — Xiv) (Xta — X10) 


t=1 [=1 


The F Value for Testing the Equality of Means of Groups a and b 


Fi Es (n=—q~-g i V)ram D2 


q(n—g)(ma+ny) ~~ ab 


The Sum of Unexplained Variations (Dixon, 1973) 


q-1 


R= a sa + Diy) 


a=1lb=a+l1 


Classification Functions 


Once a set of gq variables has been selected, the classification functions (also known as Fisher’s 
linear discriminant functions) can be computed using 


q 


bij =(n-g)) whXy 1=1,2,...,45=1,2,...59 
l=1 


for the coefficients, and 
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a; = log p; — on ig J=l,2 1g 


for the constant, where p; is the prior probability of group j. 


Canonical Discriminant Functions 


The canonical discriminant function coefficients are determined by solving the general eigenvalue 
problem 


(T—W)V =\Wv 


where V is the unscaled matrix of discriminant function coefficients and 4 is a diagonal matrix of 
eigenvalues. The eigensystem is solved as follows: 


The Cholesky decomposition 

W =LU 

is formed, where L is a lower triangular matrix, and U = 1 Wee 

The symmetric matrix L~'BU~' is formed and the system 

(L-'(T —W)U~! — AI)(UV) =0 

is solved using tridiagonalization and the QL method. The result is m eigenvalues, where 

m = min (q,g — 1) and corresponding orthonormal eigenvectors, UV. The eigenvectors of the 
original system are obtained as 

V =U-|(UV) 


For each of the eigenvalues, which are ordered in descending magnitude, the following statistics 
are calculated. 


Percentage of Between-Groups Variance Accounted for 


LOOA; 
m 
S Xk 
k=1 


Canonical Correlation 


VAk/U f An) 
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Wilks’ Lambda 


Testing the significance of all the discriminating functions after the first k: 
m 
Ay = II 1/(1+A;) k=0,1,...,m—-1 
i=k+1 


The significance level is based on 


2 


yo = -(n—(q4+g)/2—-1) nA, 


which is distributed as a y? with (q-k)(g-k-1) degrees of freedom. 


The Standardized Canonical Discriminant Coefficient Matrix D 

The standard canonical discriminant coefficient matrix D is computed as 

D=S,,V 

where 

S=diag(,/wi1, 22, ..-, Wp) 

S11= partition containing the first q rows and columns of S 

V is a matrix of eigenvectors such that V W,, V=I 
The Correlations Between the Canonical Discriminant Functions and the 
Discriminating Variables 


The correlations between the canonical discriminant functions and the discriminating variables 
are given by 


R — Si. WwW, I V 
If some variables were not selected for inclusion in the analysis (q<p), the eigenvectors are 
implicitly extended with zeroes to include the nonselected variables in the correlation matrix. 


Variables for which W;; = 0 are excluded from S and W for this calculation; p then 
represents the number of variables with non-zero within-groups variance. 


The Unstandardized Coefficients 
The unstandardized coefficients are calculated from the standardized ones using 


B= V( n— g)8.D 
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The associated constants are: 


q 


ay = -S— bit Xie 


i=1 
The group centroids are the canonical discriminant functions evaluated at the group means: 
q 


Pj = aE + So bin Xi 


t=1 
Tests For Equality Of Variance 


Box’s M is used to test for equality of the group covariance matrices. 


M=(n ~ g)log|c'| — 5: (nj — log |c“” 


j=l 


where 

C’ = pooled within-groups covariance matrix excluding groups with singular covariance matrices 
C\) = covariance matrix for group j. 

Determinants of C’ and C\’) are obtained from the Cholesky decomposition. If any diagonal 


element of the decomposition is less than 10-11, the matrix is considered singular and excluded 
from the analysis. 


P 
CW} — 2S log hii — plog (nj; — 1) 


1=1 


log 
where /;; is the ith diagonal entry of L such that (n; — 1)C’) =L'L. Similarly, 


P 
Cc - 2S  logli; — plog (n -9) 
i=1 


where 


log 


(n' —9)c Jey 
n= sum of weights of cases in all groups with nonsingular covariance matrices 


The significance level is obtained from the F distribution with t; and tz degrees of freedom 
using (Cooley and Lohnes, 1971): 
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6(q—-1)(p4 


1 1 Opt an— 
el= Ds =: “bl — 


e > 1 1 (p—1)(p+2) 
2 Re | a 2 6(g-1) 


ti =(g—1)p(p+1)/2 
to = (ti + 2)/leo— ej 
ty Pe) 
; area eel if €g - € 
(ae vOUay : 
_ tz 2 <e 
Ife? —e is zero, or much smaller than ep, tp cannot be computed or cannot be computed 


accurately. If 


29 = €9 74 0.0001 (e2 = e?) 


the program uses Bartlett’s \° statistic rather than the F statistic: 
\? = M(1 -e;) 
with t; degrees of freedom. 


For testing the group covariance matrix of the canonical discriminant functions, the procedure is 
similar. The covariance matrices C’ and C\/) are replaced by D; and D’, where 


D;=BCB 
is the group covariance matrix of the discriminant functions. 


The pooled covariance matrix in this case is an identity, so that 


D' =(n-g)In — >, (nj -1)D; 
j 


where the summation is only over groups with singular Dj. 


Classification 


The basic procedure for classifying a case is as follows: 

m If X is the 1xq vector of discriminating variables for the case, the 1xm vector of canonical 
discriminant function values is 
f=XB+a 

m A chi-square distance from each centroid is computed 


, 


\; =(f-f;)D;'(f-f)) 
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where D ; is the covariance matrix of canonical discriminant functions for group j and f ; is 
the group centroid vector. If the case is a member of group j, Vj has a \° distribution with 
m degrees of freedom. P(X|G), labeled as P(D>d|G=g) in the output, is the significance 
level of sucha \j . 


m™ The classification, or posterior probability, is 


—x7/2 


P(G,|X) = —— Pi|Dj\— 


ae [1/2 9-x5/2 


ae p; is the aie probability for group j. A case is classified into the group for which 
P (G,|X) is highest. 


The actual calculation of P(G_j|X) is 


2 
D;| + \5) 
fee Ua Lee ) 


g 
P(G;|X) = Se exp («, — max} 


j=l ; 
0 otherwise 


gj = log Pj — + (log 


if gj — max; 9; > —46 


If individual group covariances are not used in classification, the pooled within-groups covariance 
matrix of the discriminant functions (an identity matrix) is substituted for D; in the above 
calculation, resulting in considerable simplification. 


If any D; is singular, a pseudo-inverse of the form 


-1 
D jul 0 
0 0 
replaces D; 1 and |Dji1| replaces \D; |. D ji1 is a submatrix of D; whose rows and columns 


correspond. to functions not dependent on preceding functions. That is, function 1 will be excluded 
only if the rank of D; = 0, function 2 will be excluded only if it is dependent on function 1, and 
so on. This choice of the pseudo-inverse is not optimal for the numerical stability of Dj, ia» Dut 
maximizes the discrimination power of the remaining functions. 


Cross-Validation 


The following notation is used in this section: 


Table 37-2 
Notation 

Notation Description 

Xik (Matec yay 

M; Sample mean of jth group 


ee] 
M; = pe fin X jx 
oT k=1 - 


Notation Description 
Myx Sample mean of jth group excluding point X ;;, 


~ 


my 
Mix = ane ¥ fi Xj1 


l=1 
lxAk 
xu Polled sample covariance matrix 
yj Sample covariance matrix of jth group 
Lik Polled sample covariance matrix without point Xj; 


~ 


yp T 
“jk ert x,-M Xjn-Mj } uv? 
‘ Lk aa Gh Ag AGRON a 
_ na-agafin ee ae ‘ ~ : oN ~ ‘ 
— n—@ a T Ts 
(nj-Fin)(my-9)—ng (| Xje-My | Bt Xx-M) 


: T T 
di («. ») (« = s) pa (« _ ») 


Cross-validation applies only to linear discriminant analysis (not quadratic). During 
cross-validation, all cases in the dataset are looped over. Each case, say X j;., is extracted once and 


~N 


treated as test data. The remaining cases are treated as anew dataset. 


Here we compute dj, (Xs Myx) and dj, (25 j 1) ((=1,...,g.74 7). If there is ani(i 4 j) that 
satisfies (log (P;) — dj { X jx, M; } /2 > log (Pj) —d5 | X jx, Mjx } /2), then the extracted point 
Xj; is misclassified. The estimate of prediction error rate is the ratio of the sum of misclassified 
case weights and the sum of all case weights. 

To reduce computation time, the linear discriminant method is used instead of the canonical 
discriminant method. The theoretical solution is exactly the same for both methods. 


Rotations 


Varimax rotations may be performed on either the matrix of canonical discriminant function 
coefficients or on that of the correlation between the canonical discriminant functions and the 
discrimination variables (the structure matrix). The actual algorithm for the rotation is described in 
FACTOR. For the Kaiser normalization 


Lat wie if coefficients rotated 
: (squared multiple correlation) 
— m 
ye re. if{correlations rotated 
k=1 


The unrotated structure matrix is 


R=S)/WiV 
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If the rotation transformation matrix is represented by K, the rotated standardized coefficient 
matrix Dp is given by 


Dp =DK 


The rotated matrix of pooled within-groups correlations between the canonical discriminant 
functions and the discriminating variables Rp is 


Rr = RK 
The eigenvector matrix V satisfies 
V (T—W)V =A = diag(A,, Ao, ---, Am) 


where the A;, are the eigenvalues. The equivalent matrix for the rotated coefficient Vz 


' 


(Vr) (T-W)VeR 


is not diagonal, meaning the rotated functions, unlike the unrotated ones, are correlated for the 
original sample, although their within-groups covariance matrix is an identity. The diagonals of 
the above matrix may still be interpreted as the between-groups variances of the functions. They 
are the numerators for the proportions of variance printed with the transformation matrix. The 
denominator is their sum. After rotation, the columns of the transformation are exchanged, if 
necessary, so that the diagonals of the matrix above are in descending order. 
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Ensembles Algorithms 


Ensembles are used to enhance model accuracy (boosting), enhance model stability (bagging), 
and build models for very large datasets (pass, stream, merge). 


= For more information, see the topic “Very large datasets (pass, stream, merge) algorithms”. 


= For more information, see the topic “Bagging and Boosting Algorithms”. 


Bagging and Boosting Algorithms 
Bootstrap aggregating (Bagging) and boosting are algorithms used to improve model stability and 


accuracy. Bagging works well for unstable base models and can reduce variance in predictions. 
Boosting can be used with any type of model and can reduce variance and bias in predictions. 


Notation 


The following notation is used for bagging and boosting unless otherwise stated: 


K The number of distinct records in the training set. 
Xk Predictor values for the kth record. 

Yk Target value for the kth record. 

fi Frequency weight for the kth record. 


wh Analysis weight for the kth record. 


N The total number of records; N = Sf\_, fr- 

M The number of base models to build; for bagging, this is the number of 
bootstrap samples. 

T”™ (-) The model built on the mth bootstrap sample. 

Si Simulated frequency weight for the kth record of the mthbootstrap sample. 


wy Updated analysis weight for the kth record of the mth bootstrap sample. 
O, =T™ (Xx) Predicted target value of the kth record by the mth model. 


Pr” (Xx) For a categorical target, the probability that the kth record belongs to 
category 1;, i=1, ..., C, in model m. 


II (7) For any condition z, [J (7) is 1 if x holds and 0 otherwise. 
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Bootstrap Aggregation 


Bootstrap aggregation (bagging) produces replicates of the training dataset by sampling with 
replacement from the original dataset. This creates bootstrap samples of equal size to the original 
dataset. The algorithm is performed iteratively over k=1,..,K and m=1,...,M to generate frequency 
weights: 


rvu.binom (y, f) k=1 


fink = he ; 
= rv.binom | N — DE7} f*,, a otherwise 
vin Si 
N= i i 


i 


Then a model is built on each replicate. Together these models form an ensemble model. The 
ensemble model scores new records using one of the following methods; the available methods 
depend upon the measurement level of the target. 


Scoring a Continuous Target 


m= Mean 
M 
i -m 
Yk = TF Yi 
m=1 
= Median 


Sort yj" and relabel them jj,;) < ... < gaz) 
7 Mit) if MW is odd 
Ye= 41 (Ga) | it.) if AJ is even 


Scoring a Categorical Target 


m Voting 
Uk = arg Max],¢Q [Mi] > Pi (X;,) 
4 me Mi, 
Pin = Tg] Dy Pie (Xt) 


k| meMs, 
where {) = Forrg nda |My, |} 
m Highest probability 
Oe = argmax;, (max,,, Lee (Xx))) 
Pj, = MAX, (Fe (Xx) 
m= Highest mean probability 


M 
Vk = arg max), 4 3 Pe (X 


a ey m XY 
Pi. = 1 ¥ Pin X 
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Bagging Model Measures 


Accuracy 


Accuracy is computed for the naive model, reference (simple) model, ensemble model (associated 
with each ensemble method), and base models. 


For categorical targets, the classification accuracy is 


1 K 
Te felt (Ye == Gt) 
“k=l 


For continuous targets, it is 


eee | es 
SP fe(YR — Gk) 

am — 
SE fe(Ye — 9) 


R?2=1 


where 9 = 4 UL frye 
Note that R2 can never be greater than one, but can be less than zero. 


For the naive model, 7; is the modal category for categorical targets and the mean for continuous 
targets. 


Diversity 


Diversity is a range measure between 0 and 1 in the larger-is-more-diverse form. It shows how 
much predictions vary across base models. 


For categorical targets, diversity is 


ik 
1 : 
Nor So fel (yx) [M — L (yn)] 


where L (yj) = S72, ID (yx = if”). 


For continuous targets, diversity is 


fl j wu 53 - (Ye — Ue) (Ge — Ye) 


D— td, nz=m 


2 
bare ite (Yk — Ur) 
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Adaptive Boosting 


Adaptive boosting (AdaBoost) is an algorithm used to boost models with continuous targets 
(Freund and Schapire 1996, Drucker 1997). 


1. Initialize values. 


=x" if analysis weights specified 
Set wp = a; wifi 
1/N otherwise 
Set m=1, wy’ = w, ,and f;"=  f;,. Note that analysis weights are initialized even if the 


method isha to build base models does not support analysis weights. 


2. Build base model m, T’” (-), using the training set and score the training set. 


i 2 Lyw, 


Set the model weight for base model m, w” = log 


x 
2 Dhwp the 
k= 

where L, = —@?sWi we) 

max, (abs( gp —Uk ) ) 
3. Set weights for the next base model. 
m+1 
we = oy ie m+1 
aj=19; fi 
K 1-Ly, 
» Lywp fe 
where ai"*! = wr") ket ___ . Note that analysis weights are always updated. If 


1- : Lyw, 


the method used to build’ba base models does not support analysis weights, the frequency weights 
are updated for the next base model as follows: 


rv.binom (N,wy"*! fr) k=1 
em+1 m+l1 pe 
oS ; wy. fk : 
vfs rv.binom | N — vkah gmt — otherwise 
i-lJ/k ipsa ed n+l p 
Dini We j 


If m<M, set m=m+1 and go to step 2. Otherwise, the ensemble model is complete. 


K 

Note: base models where os L,wy" fy > 0.5 oF max; (abs(g;" — y,)) are removed from the 
k=1 

ensemble. 

Scoring 


AdaBoost uses the weighted median method to score the ensemble model. 


hh 


Sort y;’" and relabel them #/(1) <... < Yaz), retaining the association of the model weights,w’”, 
and relabeling them w1),...,w/ar) 
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The ensemble predicted value is then %, = %;), where iis the value such that 


i-1 M 4 
m 1 m. - m. 
) wos 2 ) wos ) Ww 
m=1 m=1 m=1 


Stagewise Additive Modeling using Multiclass Exponential loss 


Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME) is an 
algorithm that extends the original AdaBoost algorithm to categorical targets. 


1. Initialize values. 


—-"__ if analysis weights specified 
Set wp = 5s wifi 
IN otherwise 
Set m=1, w;” =u, ,and f/"= f;. Note that analysis weights are initialized even if the 


method used to build base models does not support analysis weights. 


2. Build base model m, T” (-), using the training set and score the training set. 


Set the model weight for base model m, w™ = log Laer Hlog(C-1) 
K 

where err™ = S° wf full (yn # HR): 
k=1 


3. Set weights for the next base model. 


m+1 


get = a) 
“kh ot pas aidan 
where a gi = wr exp (w IT (yp if )). Note that analysis weights are always updated. Ifthe 


method used to build base models does not support analysis weights, the frequency weights are 
updated for the next base model as follows: 


. ; +le 
ru.binom (N ; wh fie) k= 1 


m+1 mAl pr 
= : ay eee w, fie : 
hk rv.binom | N- Sea fet! Ks otherwise 


t d ’ sk—-1l, m+1 p¢ 
iene 1 Vy; fi 


If m<M, set m=m+1 and go to step 2. Otherwise, the ensemble model is complete. 


Note: base models where err,,, = 0 or w’”” <= 0 are removed from the ensemble. 


Scoring 


SAMME uses the weighted majority vote method to score the ensemble model. 


The predicted value of the kth record for the mth base model is #7; = arg max;, Pr" (X;,). 
M 
The ensemble predicted value is then i; = arg max;, ye w' IT (gj!" == 1). Ties are resolved at 


m=1 


random. 
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The ensemble predicted probability is py, = » ee (X;.) 


Boosting Model Measures 


Accuracy 


Accuracy is computed for the naive model, reference (simple) model, ensemble model (associated 
with each ensemble method), and base models. 


For categorical targets, the classification accuracy is 
1 K 
WV S > fall (Ye == Ie) 

k=1 


For continuous targets, it is 


a a 
Sf (Ye — Gk) 


R?=1 : 
OK felve — 9) 


where 7 = 4D} 4 faye 
Note that R2 can never be greater than one, but can be less than zero. 


For the naive model, i. is the modal category for categorical targets and the mean for continuous 
targets. 
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Very large datasets (pass, stream, merge) algorithms 


We implement the PSM features PASS, STREAM, and MERGE through ensemble modeling. 
PASS builds models on very large data sets with only one data pass; STREAM updates the 
existing model with new cases without the need to store or recall the old training datas MERGE 
builds models in a distributed environment and merges the built models into one model. 
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In an ensemble model, the training set will be divided into subsets called blocks, and a model will 
be built on each block. Because the blocks may be dispatched to different threads (here one process 
contains one thread) and even different machines, models in different processes can be built at the 
same time. As new data blocks arrive, the algorithm simply repeats this procedure. Therefore it 
can easily handle the data stream and perform incremental learning for ensemble modeling. 


Pass 


The PASS operation includes following steps: 


1. Split the data into training blocks, a testing set and a holdout set. Note that the frequency weight, 
if specified, is ignored when splitting the training set into blocks (to prevent blocks from being 
entirely represented by a single case) but is accounted for when creating the testing and holdout 
sets. 


2. Build base models on training blocks and build a reference model on the testing set. A single 
model is built on the testing set and each training block. 


3. Evaluate each base model by computing the accuracy based on the testing set. Select a subset 
of base models as ensemble elements according to accuracy. 


4. Evaluate the ensemble model and the reference model by computing the accuracy based on 
the holdout set. If the ensemble model’s performance is not better than the reference model’s 
performance on the holdout set, we use the reference model to score the new cases. 


Computing Model Accuracy 


The accuracy of a base model is assessed on the testing set. For each vector of predictors x; and 
the corresponding label c; observed in the testing set T, let é(.:;) be the label predicted by the 
given model. Then the testing error is estimated as: 


T| 


E= Gh) (fi: Mai # e(ai))) 


yam 


Categorical target. i=1 


IT 
B=7) Doh: vi — Vil) 
i=1 
DA 
Continuous target. i=] 


Where I (c; 4 é(2;)) is Lif c; A é(a;) and 0 otherwise. 


The accuracy for the given model is computed by A=1—E. The accuracy for the whole ensemble 
model and the reference model is assessed on the holdout set. 
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Stream 


When new cases arrive and the user wants to update the existing ensemble model with these 
cases, the algorithm will: 


1. Start a PASS operation to build an ensemble model on the new data, then 


2. MERGE the newly created ensemble model and the existing ensemble model. 


Merge 


The MERGE operation has the following steps: 


1. Merge the holdout sets into a single holdout set and, if necessary, reduce this set to a reasonable 
size. 


2. Merge the testing sets into a single testing set and, if necessary, reduce this set to a reasonable size. 
3. Build a merged reference model on the merged testing set. 


4. Evaluate every base model by computing the accuracy based on the merged testing set. Select a 
subset of base models as elements of the merged ensemble model according to accuracy. 


5. Evaluate the merged ensemble model and the merged reference model by computing the accuracy 
based on the merged holdout set. 


Adaptive Predictor Selection 


There are two methods, depending upon whether the method used to build base models has an 
internal predictor selection algorithm. 


Method has predictor selection algorithm 


The first base model is built with all predictors available to the method’s predictor selection 
algorithm. Base model j (j > 1) makes the ith predictor available with probability 


ngtC 3 
}; = max {| ———,, ( 
! njtC” 


where n ; is the number of times the ith predictor was selected by the method’spredictor selection 
algorithm in the previous j—1 base models, n,; is the number of times the ith predictor was made 
available to the method’s predictor selection algorithm in the previous j—1 base models, C is a 
constant to smooth the value of p;, and 3 is a lower limit on p;. 


Method does not have predictor selection algorithm 


Each base model makes the ith predictor available with probability 


p= £— pi)” if pi < 0.05 
B otherwise 


where p; is the p-value of a test for the ith predictor, as defined below. 


m™ Fora categorical target and categorical predictor, p is a chi-square test of 


= =2>7 Fos a ?, where Gi’ a a) Nag (Nii/Nis) Ny and with degrees 
ic else 
freedom (J — 1) (.J — 1). N;; is the number of cases with X=i and Y=j, N;. = pve 
N.j = Sc, Ni, and Nj; = Nj.N.j/N. 
m Fora categorical target and continuous predictor, p is an F test of 
N;(@)-2) /(J-1) . : 
F= 3 if with degrees of freedom J —1,N — J. Nj; is the 


7 (Nj—1)s;/(N—J) 


Lj=1 
number of cases with Y=j, T; and s? are the sample mean and sample variance of X given 
Y=j, and 7 =’ Nj@;/ IN 


47=1 
m Fora continuous target and categorical predictor, » is an F test of 


Ni(G,— [=i 
= Dai pNeGi=9) (ED) itn degrees of freedom J —1,.N —J. N; is the 
yoy (Ni Dy), /(N-D) 
number of rae s with a y, and s( y); are the sample mean and sample variance of Y given 
X=i, and y = Uj_, Nig; /N. 
m™ For a continuous target and continuous predictor, p is a two-sided t test of t = r V \—; where 
oN (2i-F )(yi-9) /(N-1 F a g: ; 
r= oo ) a! c ’ and with degrees of freedom .V — 2. s(x)” is the sample variance 


V s(a )? s(y)? 
Da . 
of X and s(y)~ is the sample variance of Y. 


Automatic Category Balancing 


When a target category occurs relatively infrequently, many models do a poor job of predicting 
members of that rarely occurring category, even if the overall prediction rate of the model is fairly 
good. Automatic category balancing should improves the model’s accuracy when predicting 
infrequently occurring values. 


As records atrive, they are added to a training block until it is full. Then the proportion of records 
in each category is computed: C’; = $+ where w; is the weighted number of records taking 
category i and w is the total weighted number of records. 


E If there is any category such thatC’; <a/(10-|C|) , whergc’| is the number of target categories 
and a= 0.3, then randomly remove each record from the training block with probability 


Min {( — Min(C)/C,), (1 2 er) 


This operation will tend to remove records from frequently-occurring categories. Add new records 
to the training block until it is full again, and repeat this step until the condition is not satisfied. 


E If there is any categorysuch that C’; < a/ |C|, then recompute the frequency weight for record kas 
fic = fx max (10,a max(C)/C;,)), where ¢(i:) is the category of the kth record. This operation 
gives greater weight to infrequently occurring categories. 
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Model Measures 


The following notation applies. 


N Total number of records 

M Total number of base models 

fi The frequency weight of record k 

Yk The observed target value of record k 

UK The predicted target value of record k by the ensemble model 
oR The predicted target value of record k by base model m 
Accuracy 


Accuracy is computed for the naive model, reference (simple) model, ensemble model (associated 
with each ensemble method), and base models. 


For categorical targets, the classification accuracy is 


1 K 
WV S- fell (Ye == Gx) 
~ k=l 


where 


; Lif (ye = de 
Hin = 1) = | ftir 


For continuous targets, it is 


K¢ eC 
2 SR fe (YR — Gk) 


i =1 


where 7 = UK, fryn 


eee 9 
UK Se (ve — 9) 


Note that R2 can never be greater than one, but can be less than zero. 


For the naive model, #;, is the modal category for categorical targets and the mean for continuous 


targets. 


Diversity 


Diversity is a range measure between 0 and 1 in the larger-is-more-diverse form. It shows how 
much predictions vary across base models. 


For categorical targets, diversity is 


1 K 


VAP >» fel (ye) (M — £ (yr) 
° - k=1 


where L(y.) = So’ 


4em=1 


IT (yy = yj) and IT (y, = gj") is defined as above. 


Diversity is not available for continuous targets. 


Scoring 


There are several strategies for scoring using the ensemble models. 


Continuous Target 


1 M 


Mean.ii,Psat= yp don —1 Vie 


Median.7;, ps7 = Median! (Yio ) 


where ¥j;, psa is the final predicted value of case i, and ¥;,,, is the mth base model’s predicted 
value of case i. 


Categorical Target 


Voting. Assume that dmx represents the label output of the mth base model for a given vector 
of predictor values. dmx = 1if the label assigned by the mth base model is the kth target category 
and 0 otherwise. There are total of M base models and K target categories. The majority vote 
method selects the jth category if it is assigned by the plurality of base models. It satisfies the 
ollowing equation: 


M M 


Fe ding = marks (5 cn 


m=l1 m=1 


Let £,,, be the testing error estimated for the mth base model. Weights for the weighted majority 
vote are then computed according to the following expression: 


M 
1-E 1B 
Wm = max (10g a 0) /S— max (10s 7 0) 
—~™ (21 


44 


Probability voting. Assume that pm, is the posterior probability estimated for the kth target 
category by the mth base model for a given vector of predictor values. The following rules 
combine the probabilities computed by the base models. The jth category is selected such that it 
satisfies the corresponding equation. 


: oe ; M 
@ Highest probability. maxM | (Pn 3) = max; _,(max,, 1 (Pm.x)) 


: F Se M . “ 1 M 
m Highest mean probability. = }° 3 Pm,j max/* , (+ nei Pmk 
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Ties are resolved at random. 


Softmax smoothing. The softmax function can be used for smoothing the probabilities: 


5 ___ Bxp(pi) 

ns 
S> Exp(n) 
i=1 


where p, is the rule-based confidence for category i and p® is the smoothed value. 


ERROR BARS Algorithms 


This section describes the algorithms for error bar computation of the mean, median and their 
confidence intervals for a simple random sample. 


Notation 
The following notation is used throughout this section unless otherwise noted: 


Let y, < ... < y,, be m ordered observations for the sample and w,..., w,,, be the corresponding 
case weights. Then 


a 


ww; = = w), = Cumulative sum of weights up to and including 1; 
k=1 
and 
m 
W =wwp, = S w), = total sum of weights 
k=1 


p = <4 ,CTis the confidence interval level 0 < CI < 100. 


Descriptive Statistics 


The following statistics are available. 


Mean 


Confidence Interval for the Mean 
Lower bound = 7-IDF.T (44, -1) SE 
Upper bound = 7+ 7DF.T (44, —1) SE 


where SE is the standard error, and IDF.T is the inverse student t function documented in the 
COMPUTE command. 


Variance 
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Standard Deviation 


Standard Error 


Median 


SE= 


VW 


The Aempirical method in the EXAMINE procedure is used for computation of the median. 


Let 


and k satifies 
wu, SVU << Wwe 
Then, 
g=v—wuy, 
Let m be the estimated median, then it is defined as 


= Wate) g=0 
Yr+1, g >O 


Confidence Interval for the Median 


Note: the case weights w,,--+,«,,, must be integers for the following computation. If at least one 
weight is not integer, an error message is issued. 

Let 

b; = Pr[Binomial (W,0.5) > ¢] 


Ww = 
t 


j=t 


IB (0.5;i, W —i) 


I 


where IB is the incomplete Beta function. 


Define 

yy = “4< Bi 7 "0.5)<W-i - ; - 

tS Ws Binomial (W,0.5) <W i] i =0,1,..., floor (W/2) 
on t~ YW-i4+l1 . 

and define 


Yw/241 = 0, if Wis even; 


327 
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Vw+1)/2 = 0, if Wis odd. 


Algorithm: Hettmansperger-Sheather Interpolation (1986) 


1. Re-index all the cases to be x; < w»,.... < xy in which 
Ty = TQ, 006 = Tww4 = yl 
Tww,+ = Cwwyt2ers = Lwwe = Y2 
Cwwm— itl = Cww +2 aaa? _ Cww yn, = Ym 
2. If Wis even, compute 4)... 44>). 
If Wis odd, compute 4,.... 9 (14.1) /2 


3. Choose the smallest index k such that +;,,; < p. If kis found, go to Step 4; otherwise, stop 
and issue a message. 


4. Compute 
| tes 
and 
= (Wak) 


k+(W—2h)0* 
The p confidence interval is 
Lower bound = \- x)... + (1—A)- 4 


Upper bound = A: Lw—k + (1 _ Xr) *TW-k4+1 
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EXAMINE Algorithms 


EXAMINE provides stem-and-leaf plots, histograms, boxplots, normal plots, robust estimates of 
location, tests of normality, and other descriptive statistics. Separate analyses can be obtained for 
subgroups of cases. 


Univariate Statistics 


This section discusses the computation of statistics for a variable considered on its own. 


Notation 
The following notation is used throughout this chapter unless otherwise noted: 


Let y; < ...Y,, be m distinct ordered observations for the sample and c), ...,c,,, be the 
corresponding caseweights. Then 

i 
eo Ss" c, = cumulative frequency up to and including y; 


k=1 


and 


m 
W = cep, bao: total sum of weights. 
k=1 


Descriptive Statistics 


The following statistics are available. 


Minimum and Maximum 


min = yj, max = Ym 


Range 


range = Ym — Y1 


Mean 
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Confidence Interval for the Mean 
lower bound = 9 — ta/2,w—1SE 
upper bound = ¥ + ta/o w—1SE 
where SE is the standard error. 


Median 


The median is the 50th percentile, which is calculated by the method requested. The default 
method is HAVERAGE. 


Interquartile Range 


(IQR) IQR = 75th percentile — 25th percentile, where the 75th and 25th percentiles are calculated 
by the method requested for percentiles. 


Variance 


m 


ae pc! Yi 2g) 


= 


Standard Deviation 


Standard Error 


s 


Se — — 
VW 


Skewness and SE of Skewness 


; WM, 
91 = (W-1)(W —2)s3 


; - 6W(W—1) 
SE(g1) = Foi FI)(W+3) 


m 
M3 =S—e(yi — 9)" 


i=l 
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Kurtosis and SE of Kurtosis 


— W(W+1)M,-3M3(W-1) 


92 = “(W—1)(W—2)(W —3)s* 
m 
Mz = > ci(yi — 9) 
i=1 
m 
Ma =) ely — 9)" 
i=1 


; _ — [A(W?=1)SE?(q1) 
SE(g2) = wanes 


5% Trimmed Mean 


kel 
(CChy41 — tC)Yiy41 + (W = ccp,—-1 — tC) + D> cay 


i=k,+2 


1 
fee 
"9 0.917 


where f/:, and ‘2 satisfy the following conditions 

Cl = 10-5 Ce 45. Weep, 1S Weg 37 
and 

te = 0.05W 


Note: Ifk, +1= ko, then To9.9 = Yko 


Percentiles 


There are five methods for computation of percentiles. Let 
tc) =Wp, teg=(W+1)p 


where p is the requested percentile divided by 100, and k, and kz satisfy 


CCk, <tey < CChy, +1 
CCR, << te2 < CCKL41 


Then, 
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y OF Ste = coy 


_ (te2=CCke) gy = tea — CCR, 


Let x be the pth percentile; the five definitions are as follows: 
Waverage (Weighted Average) 

Round (Closest Observation) 

Empirical (Empirical Distribution Function) 

Haverage (Weighted Average) 


Aempirical (Empirical Distribution Function with Averaging) 


Waverage (Weighted Average) 


This is a weighted average at 1;,.., . 


Yki tI if gj 21 
R= < (Lg) up Fite if gf < Landc;,,4, 21 
(1 — 91) Vk, F 91k, if CH <1 and cx, ar Std 


Round (Closest Observation) 
This is the observation closest to y;-,. 


If Chi 41 = 1, then 


_JSu,  ifgp <3 
u * 1 
Yor ifg? 25 

If Cky41 <4, then 
gcd Ole if g1 <% 
“_ i 1 
Yeti ifgi 25 


Empirical (Empirical Distribution Function) 


ps Uk, if 9 = 0 
" ) Yn41 if gy > 0 
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Haverage (Weighted Average) 
This is a weighted average at y;.... 
Yko+1 if g5 21 


w= 4 (1-93) ¥b. + 93¥e+1 ifgs < land cy,4) 21 
(1 —g2)yn, 7 92Yko+1 if gS < Land cys, <1 


Aempirical (Empirical Distribution Function with Averaging) 


pa) Yan + yrs1)/2 if gt =0 


Note: If either the 25th, 50th, or 75th percentiles is request, Tukey Hinges will also be printed. 


Tukey Hinges 
Let Q,, Q2, and Q3 be the 25th, 50th, and 75th percentiles. If c* > 1, where c* = min(c1, ...,¢m)s 
define 
oe greatest integer<((W/+3)/2) 

ine 2 
Ly = 
Ig =W/2+1/2 
I3=W+1-d 
Otherwise 
j greatest integer < (IW/c*+3)/2 
— 
2 
and 


Ty = ac" 

[y= W/2+c*/2 

Ls =W + Cc — dc* 

Then for every i, i = 1, 2,3, find h; such that 


cCp, S Ly < cep, 41 


and 


(l—ai)yn, + ayn 41 Wat < Landcp., <1 


(1 —a})yn, +%yn,41 az < Land cp.44 21 
Qi = i 
Yhi+1 ifa; 2 1 


where 
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on ae 
a; = Li — Cccp, 


ay = ae ae 


Che 41 


M-Estimation (Robust Location Estimation) 
The M-estimator T of location is the solution of 


m 
7 
) ai (! ) =) 
s 


i=1 


where © is an odd function and s is a measure of the spread. 


An alternative form of M-estimation is 


W 


After rearranging the above equation, we get 


(MT 
Be) 


The algorithm stops when either 
\Ty, So Ti, | < e|(T}. a ea Ty.) /2|, where ¢€ = 0.005 


or the number of iterations exceeds 30. 


M-Estimators 


Four M-estimators (Huber, Hampel, Andrew, and Tukey) are available. Let 


where 


s = median of 7/1, ... , Jn, With caseweights c1, .... Cm 


and 


yi = \yi — y|, where 7 is the median. 


Huber (k), k > 0 
i) if juj]<k 
w(uj) = E sgn(ui) if |u| > k 
The default value of k = 1-339 


Hampel (a, b, c), 0<a<b<c 
1 


if |uj| <a 
4 sgn(uj) ifa < |u;| <b 
w(ui) =) a cxlw sgn(uj) itb< |u| Se 


u; e—b 


0 if |ui| >c 


By default, a= 1.7,b=3.4andc= 8.5. 


Andrew’s Wave (c), c > 0 


—— sin (#2) if |uj| <c 
w(uj) = 


0 if |x;| >c 


By default, c = 1.347 
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Tukey’s Biweight (c) 
14)" if |uj| <c 
w(uj) = ( : | 
0 if |u;| >¢ 


By default, c = 4.685. 


Tests of Normality 


The following tests are available. 


Shapiro-Wilk Statistic (W) 


Since the W statistic is based on the order statistics of the sample, the caseweights have to be 
restricted to integers. Hence, before W is calculated, all the caseweights are rounded to the closest 
integer and the series is expanded. Let c; be the closest integer to c;; then 


a m 
a v* a. vk 
a=) Chey Ws = CC = ) Ch 
k=1 k=1 
The original series y = {y), ..., 4} is expanded to 
t= {x1, een y ly } 
where 
Leet 41 oe =Ueer = Yi, DH 1,..., 


Then the W statistic is defined as 


Ww 2 
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a, =a => 
: Me T((W.4+1)/2) £W . 
Vr(W.j2+t) Ws > 20 
a, = [q2 a [2 
1= ~V epews = Vf ow 
a; = (2/c)m;,i = 2,...,W, -—1 


m; = v7! (Go) , where W is the c.d.f. of a standard normal distribution 
s— 40 


a = 0.314195 + 0.06333638 — 0.0108953? 


B = logigWs 
Wil 2 
9 F 
c=4 ) —— > 
— Ia* 
i=l (1 207) 


Based on the computed W statistic, the significance is calculated by linearly interpolating within 
the range of simulated critical values given in Shapiro and Wilk (1965). 


If non-integer weights are specified, the Shapiro-Wilk’s statistic is calculated when the weighted 
sample size lies between 3 and 50. For no weights or integer weights, the statistic is calculated 
when the weighted sample size lies between 3 and 5000. 


If W > wo.99, the critical value of 99th percentile, the significance is reported as>0.99. Similarly, 
if W < wo,91, the critical value of first percentile, the significance is reported as <0.01. 


Kolmogorov-Smirnov Statistic with Lilliefors’ Significance 


Lilliefors (Lilliefors, 1967) presented a table for testing normality using the Kolmogorov-Smirnov 
statistic when the mean and variance of the population are unknown. This statistic is 


D,, =max{ D+;D=} 

where 

Dio = maxi F(yi) — F(yi)} 
D_= maxi F(yi) = F(yi-1)} 


where F'(:x) is the sample cumulative distribution and /’(2:) is the cumulative normal distribution 
whose mean and variance are estimated from the sample. 
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Dallal and Wilkinson (Dallal and Wilkinson, 1986) corrected the critical values for testing 
normality reported by Lilliefors. With the corrected table they derived an analytic approximation 
to the upper tail probabilities of D,, for probabilities less than 0.1. The following formula is used 
to estimate the critical value D,.. for probability 0.1. 


(—» — VP = Tac) 


2a 


D,. = 


where, if WW < 100, 


a = —7.01256(W + 2.78019) 


b = 2.99587/W + 2.78019 


C = 2.1804661 + 0.974598 , 1.67997 
JW vi 


If W > 100 

a = —7.90289126054 * W998 

b = 3.180370175721 « W249 

c = 2.2947256 

The Lilliefors significance p is calculated as follows: If D. = D.,p =.0.1 


If Da > De, p = exp {aD? + bDag+e- 2.3025851} 


If Do.2 < Da <Dz linear interpolation between Dp.2 and D. where Dp .2 is the critical value 
for probability 0.2 is done. 


If D, > Do.2,pis reported as > 0.2. 


Group Statistics 


Assume that there are /:(/: > 2) combinations of grouping factors. For every combination i, 


4=1,2,...,k, let {yi, ..., Yim, } be the sample observations with the corresponding caseweights 


{ei1, 1339 Cimy hs 


Spread versus Level 


If a transformation value, a, is given, the spread(s) and level(/) are defined based on the 


transformed data. Let x be the transformed value of y; forevery i=1,...,4,7 =1,..., 1; 
_ _finy; ifa=0 
oS yi otherwise 


Then the spread (s;)and the level (/;)are respectively defined as the Interquartile Range and the 
median of {2;;,...,: ‘i, } With corresponding caseweights {¢;, ..., im, }. However, if a is not 
specified, the spread and the level are natural logarithms of the Interquartile Range and of the 
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median of the original data. Finally, the slope is the regression coefficient of s on I, which is 
defined as 


k = 
(t; — 1) (s; — 3) 
= 


U 


i=l 


In some situations, the transformations cannot be done. The spread-versus-level plot and Levene 
statistic will not be produced if: 


m™ ais anegative integer and at least one of the data is 0 
™ ais anegative non-integer and at least one of the data is less than or equal to 0 
™ aisa positive non-integer and at least one of the data is less than 0 


™ ais not specified and the median or the spread is less than or equal to 0 


Levene Test of Homogeneity of Variances 


The Levene test statistic is based on the transformed data and is defined by 


i=l l/=1 
where 
Ma; 
Wi = y Cil 
l=1 
mM; 
S CUT 
a=. — l=l 
‘= i 
oj] = Cg] — Bi | 
Mm; . 
Soe CLL 
— Ww; 
k-1 = 


g2| 


EXAMINE Algorithms 


The significance of L,, is calculated from the F distribution with degrees of freedom k — 1 and 
Wk. 


Groups with zero variance are included in the test. 


Robust Levene’s Test of Homogeneity of Variances 


With the current version of Levene’s L,, the followings can be considered as options in order to 
obtain robust Levene’s tests: 


m Levene’s test L, based on ae) = |X, — £;| where 7; is the median of X;;’s for group i. 
Median calculation is done by the method requested. The default method is HAVERAGE. 
Once the 7;’s and hence z My ’s are calculated, apply the formula for L.,, shown in the section 
above, to obtain LZ, by replacing 7;;, 2; and = with gh?) zl) and =") respectively. 

Two significances of L;, are given. One is calculated from a F-distribution with degrees of 
freedom k — 1 and W-k. Another is calculated from a F-distribution with degrees of freedom 
k—1andv. The value of v is given by: 


k 2 


and 

Uy = Wy 1 

m Levene’s test L,. based on a = |°31— T;.9.9| where T; 9.9 is the 5% trimmed mean of x;;’s 
for group i. 
Once the T;,9.9’s and hence =; ”*s are calculated, apply the formula of L, to obtain L.. by 
replacing zil, Z; and = with 2,"’, =\ and =\°) respectively. 


The significance of L,. is calculated from a F-distribution with degrees of freedom k — 1 
and W- k. 
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Plots 


The following plots are available. 


Normal Probability Plot (NPPLOT) 


For every distinct observation y;, /?; is the rank (the mean of ranks is assigned to ties). The 
normal score VS; is calculated by 


R; 
NS; =v} : 
; (4) 


where ~—! is the inverse of the standard normal cumulative distribution function. The NPPLOT is 


the plot of (y1,.V.S1), ---5(Ym,NSm)- 
Detrended Normal Plot 
The detrended normal plot is the scatterplot of (y,, D1), ---, (Ym ,Dm), Where D; is the difference 


between the Z-score and normal score, which is defined by 


D; = Zi = NS; 
and 
Yi —Y 
Ve Yi-y 
S 


where 7 is the average and s is the standard deviation. 


Boxplot 


The boundaries of the box are Tukey’s hinges. The length of the box is the interquartile range 
based on Tukey’s hinges. That is, 


IQR = Q3-— Qi 
Define 
STEP = 1.5 IQR 
A case is an outlier if 
Q3+ STEP < yj < Q3+2STEP 
01 —2STEP <yi<Q1 —STEP 


A case is an extreme if 
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yi > Q3+2STEP 


or 
Yi < Q1 —2STEP 
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EXSMOOTH Algorithms 


EXSMOOTH produces one period ahead forecasts for different models. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


xX; Observed series, t= 1, ...,n 

x Forecast of one period ahead from time t 
P Number of periods 

K Number of complete cycles ([n/p]) 

eS tth residual (x: - X,_1) 

So Initial value for series 

To Initial value for trend 

Pep eiaa, lo Initial values for seasonal factors 


my) Ip 
Mean for the Ith cycle, a Xi/p 


1=(1-1)p4+1 
Note the following points: 


m J/,\_,, ..., /o are obtained from the SEASON procedure with MA = EQUAL if p is even; 
otherwise MA = CENTERED is used for both multiplicative and additive models. 


m= The index for the fitted series starts with zero. 


m The value saved in the FIT variable for the tth case is X;_). 


Models 


The following models are available. 


No Trend, No Seasonality Model 
X,=b+ea 


Initial value 


So=X 
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Xo =So, e1=X1-—Xo 


Se = St_-1 + ey 


No Trend, Additive Seasonality Model 


X,= 064 I; Et 


Initial value 


Sy == 
then 

Xo = Sy + Ip 
ey = X, — Xo 


St = Sr] + Qe; 
I; = It_p + o(1 = a jet 


X= Si hei 


No Trend, Multiplicative Seasonality Model 


X} = bl; + & 


Initial value 


Xo = Soli—p 
ey = X, — Xp 


Si = St-1 a aet/ Typ 
Fy = Ie-p + (1 — eee /S¢ 


x = Selt_p4i 


Linear Trend, No Seasonality Model 


Xt = bo + bit + & 


Initial values 


— An-X 
To ~~ n—-l F 
So a X = x10 
then 
Xo = So +To 
ey = X1 — Xp 


Sp = St-1 + Ty-1 + aey 
T, =T-1 + aye, 


X,=5+T; 


Linear Trend, Additive Seasonality Model 
Xie=botht+h+e 
Initial values 
19 = (kali 


Y r Tn 
So = X1 — §To 


then 
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Xo = 80+ To + Li~p 
Si = eA | Th-1 } Qer 


Ty = Tt-1 + ayes 
if = lip + o(1 = aye; 


Xp = Se +T + he-pts 


Linear Trend, Multiplicative Seasonality Model 
Xt = (bo + byt) I; + & 


Initial values 


— me-mM 
Ty = (k—l)p 


So =m, — §Tp 


then 


Xo = (So + To) I1_p 
St = St1+T-1 + a(es/It-p) 
Ty = Tr-1 + 09 (€/Ie-p) 

if = lt_p } o(1 _ a)(er/S¢) 
Xt = (5; + T;) Tp_p41 


Exponential Trend, No Season Model 


X; = bobh + 


Initial values 
Ty = exp {In X2 —In Xi} = 


So = exp {In X,- 5 In Ty} = Tk 


then 
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Xo = So Tp 
St = St1Th_-1 + ares 
Ty = 14-4 + ayer /St-1 


X, = 5:7: 


Exponential Trend, Additive Seasonal Model 
X, = bob +hte 
Initial values 


Tp = exp {(In mz — lnm,)/p} 


So = exp {Inm, — §ln To} 


then 


Xo = Solo + li-p 
Se = Se_1Te-1 + cet 


Ty = Ty + ayer /St-1 
fF = lt_p + d(1 = ayer 


X= ST, + Le-p 41 
Exponential Trend, Multiplicative Seasonality Model 
Ag= (bob) I, + € 


Initial values 
Ty = exp {(In mz — Inm ,)/(k — 1)} 
So = exp {In m,— ln Ty } 


then 
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Xo = (SoTo) Li-p 

St = S21 Tai Oey le, 

Ty = Ty-1 + ayer /(t-pSt-1) 
Ty = Typ + 6(1 — a er /S} 


xX =. (S;T)) Tt-p4 1 


Damped Trend, No Seasonality Model 
Xt= bo + @byt + 


Initial values 


ae 
1g = 


n #41 
(n—l)@ 


So = X1 — $T 
then 


Xp = 59+ ¢To 
Sp = Sy-1 + OT-1 + ae, 


T; = OT;_-1 +a Yer 


X,=5:+0T 


Damped Trend, Additive Seasonality Model 
X{ = bo + obyt + I; + Ef 


Initial values 


— Me-™M 
To ~~ (k-1)p@ 


Sp =m, — 8T 
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Xo = So + To + Ii-p 
St = Se] + oT-1 +a(2—-—a Jey 
Ty = oT;-1 +ala— + ley 
I; = It_p + dil = a(2 = let 


Xp = 5+ dN + ht-py 


Damped Trend, Multiplicative Seasonality Model 
X- = (bop + bot) I, + &| 


Initial values 


_ ma-m4 
Ty = (k—1)po@ 


So =m — §To¢ 
then 


Xo = (So+ ¢To) hp 
Si = Sr-1 + o0T;y-; +a(2—a et/Tt—p 


T, = $T;-1 + a(a — 6 + ler /Li-p 
I; = Ii_p + oll —a(2 —a)ler/S; 


Xp = (S$) + OM) Ips 
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FACTOR Algorithms 


FACTOR performs factor analysis based either on correlations or covariances and using one of the 
seven extraction methods. 


Extraction of Initial Factors 


The following extraction methods are available. 


Principal Components Extraction (PC) 


The matrix of factor loadings based on factor m is 


Am = Qin es 


where 


0 = (Ww, we Perr Wm ) 
Dm = dtag(|71\, |Val,<-+.|¥ml) 


The communality of variable i is given by 


Analyzing a Correlation Matrix 


¥1 > 42 >... > Ym are the eigenvalues and w; are the corresponding eigenvectors of R, where 
R is the correlation matrix. 


Analyzing a Covariance Matrix 


V1 > 42 >... > Ym are the eigenvalues and w; are the corresponding eigenvectors of ©, where 
Y= (Gis) ayn 8 the covariance matrix. 

. . . . _ (9 
The rescaled loadings matrix is A;.n = [diag>] “Nant 


The rescaled communality of variable iis hin = 0;; Phi: 


Principal Axis Factoring 
Analyzing a Correlation Matrix 


An iterative solution for communalities and factor loadings is sought. At iteration i, the 
communalities from the preceding iteration are placed on the diagonal of R, and the resulting R 
is denoted by R;. The eigenanalysis is performed on R,; and the new communality of variable j 
is estimated by 
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m 
2 
Aji) = a recay @Facay 
j=l 
The factor loadings are obtained by 


1/2 
\ / 


l mt) ~ *'m(t)* mi) 


Iterations continue until the maximum number (default 25) is reached or until the maximum 
change in the communality estimates is less than the convergence criterion (default 0.001). 


Analyzing a Covariance Matrix 


This analysis is the same as analyzing a correlation matrix, except D is used instead of the 
correlation matrix R. Convergence is dependent on the maximum change of rescaled communality 
estimates. 


At iteration 4 the rescaled loadings matrix is A,,,(;)~ = [diag>]"' of The rescaled 
communality of variable i is hj(;) 7 = 0j;'h ji). 


Maximum Likelihood (ML) 


The maximum likelihood solutions of A and 1? are obtained by minimizing 
>\71 >\71 
StF (AA v”) R| — log (AA v”) R| —p 


with respect to A and «, where p is the number of variables, A is the factor loading matrix, and 
w is the diagonal matrix of unique variances. 


The minimization of F is performed by way of a two-step algorithm. First, the conditional 
minimum of F for a given y is found. This gives the function f(y), which is minimized 
numerically using the Newton-Raphson procedure. Let x‘*) be the column vector containing the 
logarithm of the diagonal elements of y at the sth iteration; then 

xt) — xs) _ go) 

where d‘*) is the solution to the system of linear equations 

Hd’) = ho) 

and where 


Ho) = (0? f(w)/Ox;0x;) 


and h‘*) is the column vector containing 0 f(:)/0z;. The starting point x‘!) is 
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(1) log [(1 —m/2p)/r*) for ML and GLS 


“| [(1—m/2p)/r]'? for ULS 

where m is the number of factors and r’’ is the ith diagonal element of R~!. 

The values of f (2), Of /Ox;, and 0? f / Jx;0x; can be expressed in terms of the eigenvalues 
Vi Ss Jo S24 Sp 
and corresponding eigenvectors 


W1, WD, +--+, Hp 


of matrix @R7~!q. That is, 


k=m+1 
Dp 

Of _ -l),, 
ge =, (Ly wii 

k=m+1 

p m Z 
Of = 5 Of | : : a; Sz In 2 sd fay 
Oz,0x2, — 15 dz, 7 WikW ik > Ta win jin + OV 
rene gee es : hk — In 
k=m-+1 n=l 


The approximate second-order derivatives 


2 


a2 Pp Pp 

Of. 45 | 

Oxj)Or; a, 
. k=m-+1 


are used in the initial step and when the matrix of the exact second-order derivatives is not positive 
definite or when all elements of the vector d are greater than 0.1. If 0? f/0x? < 0.05 (Heywood 
variables), the diagonal element is replaced by 1 and the rest of the elements of that column and 
row are set to 0. If the value of (x) is not decreased by step glthe step is halved and halved 
again until the value of / (7) decreases or 25 halvings fail to produce a decrease. (In this case, the 
computations are terminated.) Stepping continues until the largest absolute value of the elements 
of d is less than the criterion value (default 0.001) or until the maximum number of iterations 
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(default 25) is reached. Veung ¢ the converged value of x (denoted by .'), the eigenanalysis is 
performed on the matrix y)R~'w. The factor loadings are computed as 


4 1/2 
Am = = bun is ty _ 15) . 


where 


Unweighted and Generalized Least Squares (ULS, GLS) 


The same basic algorithm is used in ULS and GLS methods as in maximum likelihood, except that 
S> for ULS 
2 


2 
vy, — 1)” 
y (%&=" ge Ets 


for the ULS method, the eigenanalysis is performed on the matrix R — 1 


?, where 
7» are the eigenvalues. In terms of the derivatives, for ULS 


Pp 
ue 2 Vk\ 2 
Subs = a. vain = wpe + 05; os (x3 — 2) ej 


i tk 
owe n=1 k=m+1 
and 
2 
D_, Pp 
PE atria 
Vie; Wik 

Or,Or, ply ae 
For GLS 

f Pp 
0 2 4 2 
se = >, (%-*He)wh, 


Dp m Yp dz 2 

Of c Of = Ak [My en ce 

Or, a = bigoe a s Ykwikejh y n— Win jn +7 exp |( rat vj /2| 
k=m-+1 n=1 Ik m 


and 


cal a 3 
Ox; Ox; 7 eat i 


k=m-+1 


2 


Also, the factor loadings of the ULS method are obtained by 
Am = Oy ee 


n 


The chi-square statistics for m factors for the ML and GLS methods is given by 
9 . 2p+5 2mM\ ,/- 
a AL J («) 
6 3 


with ((p —m) —pr- m)/2 


Alpha (Harman, 1976) 
Iteration for Communalities 
At each iteration i: 
E The eigenvalues (ri) and eigenvectors (Quy) of 


Ww? (R—I)H!” + I are computed. 


(2-1) (7-1) 


E The new communalities are 
m 
2 
Ayii) » Lye Mega) hy(i-1) 
j=l 


The initial values of the communalities, Hy, are 


,, —f1-1/r#  |R| > 107° andall0 < hio< 1 
ia inax; |rjj| otherwise 


where r”’ is the ith diagonal entry of R~?. 
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If |R| > 10~* and all r"’ are equal to one, the procedure is terminated. If for some i, max; |r;;| > 1, 


the procedure is terminated. 
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E Iteration stops if any of the following are true: 
maxp |hy (7) — hpg—1)| < EPS 
1 = MAX 
hy ;) = Ofor anyk: 
Final Communalities and Factor Pattern Matrix 


The communalities are the values when iteration stops, unless the last termination criterion is true, 
in which case the procedure terminates. The factor pattern matrix is 


,1/2 


1/2 
Fi, = Fp) On yl m(f) 


where f is the final iteration. 


Image (Kaiser, 1963) 


Factor Analysis of a Correlation Matrix 


m 


Eigenvalues and eigenvectors of S~'RS~' are found. 
SS diag et ash ee) 
r’’ = ith diagonal element ofR~! 


E The factor pattern matrix is 
-1/2 
Fin = SOQ mn (Am — Lm )Am 


where {2,,, and A,,, correspond to the m eigenvalues greater than 1. 
If m = 0, the procedure is terminated. 


E The communalities are 


™m 


h; = Si, (Yj —_ 1)*w7./ (yjr") 


j=l 
E The image covariance matrix is 
R+S’°R-*S* =28 
E The anti-image covariance matrix is 
S*R7'S? 
Factor Analysis of a Covariance Matrix 


Weare using the covariance matrix © instead of the correlation matrix R. The calculation is 
similar to the correlation matrix case. 
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The rescaled factor pattern matrix is Fp = [diagd|~ '/?p ,,. The rescaled communality of 
variable iis h;p = oR hi. 


Factor Rotations 


The following rotation methods are available. 


Orthogonal Rotations (Harman, 1976) 


Rotations are done cyclically on pairs of factors until the maximum number of iterations is 
reached or the convergence criterion is met. The algorithm is the same for all orthogonal rotations, 
differing only in computations of the tangent values of the rotation angles. 


E The factor pattern matrix is normalized by the square root of communalities: 


Mi = Ho, 

where 

Am = (Ay, ---;A,,,) iS the factor pattern matrix 

H = diag(h,,...,h,,) is the diagonal matrix of communalities 


E The transformation matrix T is initialized to _I.,,, 
E At each iteration i 
(1) The convergence criterion is 


™ n n za 


SVG) = Ds ny, AG (i) — ye Ae (i) /n? 


where the initial value of A’, ,,) is the original factor pattern matrix. For subsequent iterations, the 
initial value is the final value of A*,,,_,, when all factor pairs have been rotated. 


(2) For all pairs of factors (\;, \;,) where k > j, the following are computed: 


(a) Angle of rotation 


P= 1/4tan1(X/Y) 


where 


FACTOR Algorithms 


D—2AB/n  Varimax 
X=¢{ D-—mAB/n Equimax 
D Quartimax 
C — (A? - B?) In Varimax 
Y=4 C — m(A? — B?)/2n Equimax 
C Quartimax 


tins) = Fria) — Soe PP = *Arsnomy P= Leo 


n n 
A = Un(i) B =)" Un(i) 
p=l Pp | 
n n 
aay 2 a — va] D => 2upiyrpii) 
p=1 p=l 
If |sin (P)| < 10~*", no rotation is done on the pair of factors. 


(b) New rotated factors 


Aji Anca) = X,Atw) 


where Aj,;) are the last values for factor j calculated in this iteration. 


cos (P) — sin (P) | 
sin(P) cos (P) 


(c) Accrued rotation transformation matrix 


(tj,te) = (tite) |sin(P) cos (P) 


cos(P) —sin (P) | 


where ¢; and t,, are the last calculated values of the jth and kth columns of T. 


(d) Iteration is terminated when 


SV) — SVG_1| < 107 
or the maximum number of iterations is reached. 


(e) Final rotated factor pattern matrix 


; = 1/2, * 
Non — H , Mal) 


where 
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* 
Ain(f) 
is the value of the last iteration. 


(f) Reflect factors with negative sums 


If 
n 
De Aij(f) < 0 
i= 1 
then 
Aj > —Aj(f) 


(g) Rearrange the rotated factors such that 
n n 
\2 \2 
222 Dim 
y=1 J =] 


(h) The communalities are 


Oblique Rotations 


The direct oblimin method (Jennrich and Sampson, 1966) is used for oblique rotation. The user 
can choose the parameter §. The default value is 5 = 0. 


(a) The factor pattern matrix is normalized by the square root of the communalities 


2 
On = =H Am 
where 
m 
92 
hy => oe 
k=1 


If no Kaiser is specified, this normalization is not done. 


(b) Initializations 
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The factor correlation matrix C is initialized to I,,,. The following are also computed: 


a Saal if Kaiser k=] , 
"k ~ ) hy ifnoKaiser ™  7°°"? 
n 
9 i 

ae a= 1,...,™ 

j=l 

n 

*4 

vi = DI 

j=1 
vj =u; —(d/n)u 

m 
D= S- Uj 

i=1 

m 
7 = ax 

i=l 

nm 
H=)_ 87 —(5/n)D? 

k=1 
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(c) At each iteration, all possible factor pairs are rotated. For a pair of factors A’, and \i(p # 4), 
the following are computed: 


Dpqg = D — Up — Ug 
Gpqg = G — Lp — Lq 


> ee - *2 
Spq,i = Si ble Nig 


n 
— of 
Ypq = ) A; ee ae 
i=1 
n 
_ 42) «2 
~pq —_ , Ain Mig 
*2 © 
tS ) Spqirip — (d/n)upDyg 
= eC ee: C 
Z= ) SpqiripXiq — (4/2) ¥pqP pg 


as 2s ABA — (5/n)UpYpq 

R = 2pq — (5/n)uptig 

f= 5 ( Cpq — P/ Xp) 

Q = 4(ap —4eygP + R+2T)/ty 
R = F(epg(T + R)— P= Z)/ty 


> A root, a, of the equation 
B+ P'l? + Q'b+ KR =0 is computed, as well as: 


A=14 2cpga + a? 
4 1/2 


> The rotated pair of factors is 


(35,3, ) =e) 


These replace the previous factor values. 


—a 
1 


> New values are computed for 
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Up = |Aluy 
ip = A*zy 
eres \ «4 
Ug — Aaa 
i=1 
n 
= 1 * #2 
Ug = ) Aig 
i=1 


Sh = Spak 42 

D = Dygq + tip + tg 

G = Gpg + fp + Fy 

All values designated as V replaces V and are used in subsequent calculations. 
> The new factor correlations with factor p are 

Cip = fii F toCig (2 - p) 

Cpi = Cip 

Cpp = 1 
> After all factor pairs have been rotated, iteration is terminated if 


MAX iterations have been done or 
-_ <(FOVEPS) 


Fly Flg-s| 
where 
Flip — vf = . 
n 
A= 38 —(5/n)D 
k=1_ 
Fl) =FO 


Otherwise, the factor pairs are rotated again. 


> The final rotated factor pattern matrix is 


Na Ne 
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where \,,, is the value in the final iteration. 


> The factor structure matrix is 


Q 


S=/ 


where C,,, is the factor correlation matrix in the final iteration. 


Promax Rotation 


(Hendrickson and White, 1964) proposed a computationally fast rotation. The speed is achieved 
by first rotating to an orthogonal varimax solution and then relaxing the orthogonality of the 
factors to better fit simple structure. 


> Varimax rotation is used to get an orthogonal rotated matrix Ap = {,;;} . 
> The matrix P =(p;;),. is calculated, where 
J/pxm 


k+1 


bo 


ee Le 
_ AL 2 U5 
Pij 1: SOM) Dis 


Here, k (k > 1) is the power of promax rotation. 


> The matrix L is calculated. 


/ -l , 
| Ps (A rr) A RP 
> The matrix L is normalized by column to a transformation matrix 


Q=LD 


where D = (di ag ( LL)) "is the diagonal matrix that normalizes the columns of L. 


At this stage, the rotated factors are 
; ee 
fpromax temp = Q ua ri mMax* 
Because 
ol Us -1 
var ( fpro max _te mp) = (Q Q) ’ 


and the diagonal elements do not equal 1, we must modify the rotated factor to 
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tpromax = C foro max _temp 


where 


c= {uw ((aa)")} 


The rotated factor pattern is 

\ -—1 

Apromax = AvarimaxQ€ 

The correlation matrix of the factors is 
, =1; 

Ryp=C(QQ) C 

The factor structure matrix is 

As = Apro max Ref 


Factor Score Coefficients (Harman, 1976) 


Creates one new variable for each factor in the final solution. The following alternative methods 
for calculating the factor scores are available. 


Regression 
Am ya PC without rotation 
W=* Ay tANhaay PC with rotation 
RS, otherwise 
where 
S,, = factor structure matrix 
S; = A,for orthogonal rotations 
For PC without rotation if any |7;| < 10~*, factor score coefficients are not computed. For PC 


with rotation, if the determinant of A’ ,,,\,,, is less than 10~*, the coefficients are not computed. 
Otherwise, if R is singular, factor score coefficients are not computed. 


Bartlett 
wes AU 
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Anderson Rubin 


1/2 


w=(AU?RU~A) AU”? 
where the symmetric square root of the parenthetical term is taken. 


Optional Statistics (Dziubin and Shirkey, 1974) 
> The anti-image covariance matrix A = (a;;) is given by 
ty Yost pd 
ais = stint 


> The chi-square value for Bartlett’s test of sphericity is 


2 5 
7 = (w 1 m ») tow 


with p(p — 1)/2 degrees of freedom. 


R| 


> The Kaiser-Mayer-Olkin measure of sample adequacy is 


KMO; = —”! — KMO= 


where «;; is the anti-image correlation coefficient. 
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FIT Algorithms 


FIT displays a variety of descriptive statistics computed from the residual series as an aid in 
evaluating the goodness of fit of one or more models. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
DFH Hypothesis degrees of freedom 
DFE Error degrees of freedom 
€1,++-5€n Residual (error) series 
X isc 2 Y, Observed series 
n Number of cases 


Statistics Computed in FIT 


Mean Error (ME) 


n 


ME= ye e;/n 
i=1 


Mean Percent Error (MPE) 


100 : 
MPE= aes e;/ Xj 


i= 


Mean Absolute Error (MAE) 


MAE =) |eil/n 
i=l 


Mean Absolute Percent Error (MAPE) 


100— 
MAPE = — » |X 
n Dd ad | 


Sum of Square Error (SSE) 
n 
SSE=) é 
1=1 


Mean Square Error (MSE) 
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SSE/DFE if DFE is specified or DI'H is specified; 


SSE/n, if none of DFE and DF H is specified 
MSE = 
then DF E=n-DFH. 


Root Mean Square Error (RMS) 


RMS =VMSE 


Durbin-Watson Statistics (DW) 


FREQUENCIES Algorithms 


If the absolute value of any observation is greater than 1013, no calculations are done. For sorting 
of the observations, see Sorting and Searching. For information on percentiles for grouped data, 
see Grouped Percentiles. 


Notation 

The following notation is used throughout this chapter unless otherwise stated: 
Table 44-1 
Notation 

Notation Description 

Xr Value of the variable for case k 

Wh Weight for case k 

NV Number of distinct values the variable assumes 

N Number of cases 

W Sum of weights of the cases 


Basic Statistics 


The values are sorted into ascending order and the following statistics are calculated. 


Sum of Weights of Cases Having Each Value of X 


Pe a tres 
““") 0 otherwise 


where Xj is the jth largest distinct value of X. 


Relative Frequency (Percentage) for each Value of X 


Rf = (4) x 100 


where 
NV 

W' = a f; (sum over all categories including those declared as missing values) 
i=1 


Adjusted Frequency (Percentage) 


Af; = (#) x 100 
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where 
NV 

We ye f,k; (Sum over nonmissing categories) 
i=1 

and 


fe ee if X; has been declared missing 
‘| 1 otherwise 


For all Xj declared missing, an adjusted frequency is not printed. 


Cumulative Frequency (Percentage) 


Minimum 


min X;, 


Maximum 


max X; 
k 


Mode 


Value of Xj which has the largest observed frequency. If several are tied, the smallest value 
is selected. 


Range 


Maximum — Minimum 


The pth percentile 


Find the first score interval (x2) containing more than tp cases. 

rg if tp — cp, > 100/W 
th percentile — eM i 1)p/100— ceil} gc 0 /W 
p + ((W+1)p/100  cey]zr2 if (P — epi < 100/M 


where 


371 
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tp = (W + L)p/100 

cp, <tp <cpy 

x, and xy are the values corresponding to cp, and cp2, respectively 
cc, is the cumulative frequency up to x, 

cp,is the cumulative percent up tox, 


Note: when p=50, this is the median. 


Mean 
NV 
De fj Xj 
=o =] 
C= 
Moments about the mean are calculated as: 
NV 
M; =) fi(Xi-X)’ 7 =2,3,4 
i=l 
Variance 
S= ao 
Standard Deviation 
S=VS9? 


Standard Error of the Mean 


o as 055. 
SEM = Ta 


Skewness (Bliss, 1967, p. 144) 


Ss W Ms ee en 6W(W-1) 
N= wap 8e(91) = V W=2)0V 13) 


The skewness if computed only if W23 and Variance>0. 


Kurtosis 


— WWD Mi=3(WHDMZ = /4(W2-1)se(m)? 
92 = TW=HW-2(W—3)st  FE\WI2) = Vays) 


The kurtosis is computed only if W>4 and Variance>0. 
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Generalized linear mixed models 
algorithms 


Generalized linear mixed models extend the linear model so that: 
m The target is linearly related to the factors and covariates via a specified link function. 
m The target can have a non-normal distribution. 


m= The observations can be correlated. 


Generalized linear mixed models cover a wide variety of models, from simple linear regression to 
complex multilevel models for non-normal longitudinal data. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Number of complete cases in the dataset. It is an integer and n> 1. 

Pp Number of parameters (including the constant, if it exists) in the model. It is an integer 
and p> 1. 

Dx Number of non-redundant columns in the design matrix of fixed effects. It is an integer 
and px => 1. 

K Number of random effects. 

y nx 1 target vector. The rows are records. 

r nx 1 events vector for the binomial distribution representing the number of “successes” 
within a number of trials. All elements are non-negative integers. 

m nx 1 trials vector for the binomial distribution. All elements are positive integers and mj 
> rj, i=1,...,n. 

ag nx 1 expected target value vector. 

n nx 1 linear predictor vector. 

x nx p design matrix. The rows represent the records and the columns represent the 
parameters. The ith row is xP = (@i1, ..., Vip), where the superscript Tmeans transpose 
of a matrix or vector, i = 1,...,n with x;; = 1 if the model has an intercept. 

Z nx r design matrix of random effects. 

O nx 1 offset vector. This can’t be the target or one of the predictors. Also this can’t be 
a categorical field. 

B px 1 parameter vector. The first element is the intercept, if there is one. 

Y rx 1 random effect vector. 

a nx 1 scale weight vector. If an element is less than or equal to 0 or missing, the 
corresponding record is not used. 

f nx 1 frequency weight vector. Non-integer elements are treated by rounding the value 


to the nearest integer. For values less than 0.5 or missing, the corresponding records 
are not used. 


N n 
Effective sample size, N = a fi.. If frequency weights are not used, N = n. 
i=1 
0x covariance parameters of the kth random effect 
er 


covariance parameters of the random effects, 0g = [or ae | 
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Or covariance parameters of the residuals 
0 T T 
0 = [04,0%] = [6r,...,0K,9R| 
Vy\, Covariance matrix of y, conditional on the random effects 


Model 


The form of a generalized linear mixed model for the target y with the random effects y is 
n=g(E (y|y)) = X84 Zy+ Oyly~.F 


where 1 is the linear predictor; g(.) is the monotonic differentiable link function; y is a (rx 1) 
vector of random effects which are assumed to be normally distributed with mean 0 and variance 
matrix G, X is a (nx p) design matrix for the fixed effects; Z is a (nx r) design matrix for the 
random effects; O is an offset with a constant coefficient of 1 for each observation; F is the 
conditional target probability distribution. Note that if there are no random effects, the model 
reduces to a generalized linear model (GZLM). 


The probability distributions without random effects offered (except multinomial) are listed 
in Table 45-1. The link functions offered are listed in Table 45-3. Different combinations of 
probability distribution and link function can result in different models. 


See “Nominal multinomial distribution” for more information on the nominal multinomial 
distribution. 


See “Ordinal multinomial distribution” for more information on the ordinal multinomial 
distribution. 


Note that the available distributions depend on the measurement level of the target: 


= A continuous target can have any distribution except multinomial. The binomial 
distribution is allowed because the target could be an “events” field. The default 
distribution for a continuous target is the normal distribution. 


= A nominal target can have the multinomial or binomial distribution. The default is 
multinomial. 


™ An ordinal target can have the multinomial or binomial distribution. The default is 


multinomial. 
Table 45-1 
Distribution, range and variance of the response, variance function, and its first derivative 
Distribution Range of y Vip) Var(y) V’(w) 
Normal (—00,00) 1 0) 0 
Inverse Gaussian (0,00) 3 by3 32 
Gamma (0,00) 2 bu2 2u 
Negative binomial | 0(1)oo utky2 utky2 1+2ku 
Poisson 0(1)co a a 1 
Binomial(m) 0(1)m/m u(1-p) u(1—-p)/m 1-2 
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Notes 
m™ 0(1)z means the range is from 0 to z with increments of 1; that is, 0, 1, 2, ..., z. 


m= For the binomial distribution, the binomial trial variable m is considered as a part of the 
weight variable a. 


m Ifascale weight variable @ is presented, @ is replaced by ¢/o. 


m For the negative binomial distribution, the ancillary parameter (k) is estimated by the 
maximum likelihood (ML) method. When k = 0, the negative binomial distribution reduces to 
the Poisson distribution. When k = 1, the negative binomial is the geometric distribution. 


The full log-likelihood function (€), which will be used as the objective function for parameter 
estimation, is listed for each distribution in the following table. 


Table 45-2 
The log-likelihood function for probability distribution 


Distribution e 
Normal nm 
l=; Ly — ffm (2n)} 
i=1 
Inverse Gaussian its 
l=&+)>_ —F¢m(2n)} 
i=1 
Gamma 


=0.+ >. fi{—In(yi)} 
i=l 


Negative ; nee 
binomial L=ly tS G{- (Pw +) 
i=1 
Poisson is : 
l=, Sof PE = In (yit)} 
i=l ob ; 
Binomial(m) 


£=&4 2 fie {In um ) | where ( un } = aia 
i=1_— ri J ae 


ry 


The following tables list the form, inverse form, range of ji, and first and second derivatives 
for each link function. 


Table 45-3 

Link function name, form, inverse of link function, and range of the predicted mean 

Link function n=9(") Inverse p=g7!(n) Range of ji 

Identity LL 7 BER 

Log In(u) exp(n) B20 

Logit In ( he Tresit fi € [0, 1] 

Probit O-! (p0 |, where O(n) fi € [0,1] 
06) _t/ e7? qe 

Complementary In(—(In(1-p)) 1—exp(—exp(n)) fe [0,1] 

log-log 
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Link function n=9(m) Inverse p=g~ !(y) Range of fi 

aZ#0 pee nile fi € Rif al/a is odd integer 
Power(a) | a=0 { In (j2) { exp (1) { js = Ootherwise 
Log-complement | In(1—1) 1—exp(n) sl 
Negative log-log — | —In(—In(1)) exp(—exp(-n)) fi € [0, 1) 


Note: In the power link function, if |a| < 2.2e-16, a is treated as 0. 


Table 45-4 
The first and second derivatives of link function 
Link function First derivative g' (4) = a = A | Second derivative g” (11) = a " 
Ou~ 
Identity 1 0 
1 2 
Log ra -A 
Logit ial A?(24 —1) 
Probit —,1,—~, where AO 40) 
o(® = 1) ) 
e(z) = 1_¢ “f2 
a i * A214 = 
Complementary log-log a a A*(1+In(1 — p)) 
ST ay ST ae | 
a#z0 Oy A 
P i 
owerta){ eiSip { 2 { _A? 
Log-complement at —A? 
Negative log-log ata A?(1+In(#)) 


When the canonical parameter is equal to the linear predictor, 6 = 7, then the link function is 
called the canonical link function. Although the canonical links lead to desirable statistical 
properties of the model, particularly in small samples, there is in general no a priori reason why 
the systematic effects in a model should be additive on the scale given by that link. The canonical 
link functions for probability distributions are given in the following table. 


Table 45-5 
Canonical and default link functions for probability distributions 


Distribution Canonical link function 
Normal Identity 

Inverse Gaussian Power(—2) 

Gamma Power(—1) 

Negative binomial Negative binomial 
Poisson Log 

Binomial Logit 


The variance of y, conditional on the random effects, is 


var (y|y) = Al/2Rpal/2 
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The matrix A is a diagonal matrix and contains the variance function of the model, which 
is the function of the mean p, divided by the corresponding scale weight variable; that is, 
A = diag(V(j1;)/wi),i = 1,..., n. The variance functions, V(u1), are different for different 
distributions. The matrix R is the variance matrix for repeated measures. 


Generalized linear mixed models allow correlation and/or heterogeneity from random effects 
(G-side) and/or heterogeneity from residual effects (R-side). resulting in 4 types of models: 


1. Ifa GLMM has no G-side or R-side effects, then it reduces to a GZLM; G=Oand R = oI, where I 
is the identity matrix and ¢ is the scale parameter. For continuous distributions (normal, inverse 
Gauss and gamma), @ is an unknown parameter and is estimated jointly with the regression 
parameters by the maximum likelihood (ML) method. For discrete distributions (negative 
binomial, Poisson, binomial and multinomial), @ is estimated by Pearson chi-square as follows: 


. 1 3 f (yi — pi)? 
QoQ=-— iwi, 
TNT V(t) 


where V* = N — p, for the restricted maximum pseudo-likelihood (REPL) method. 


to 


If a model only has G-side random effects. then the G matrix is user-specified and R = @I. ¢ is 
estimated jointly with the covariance parameters in G for continuous distributions and @ = 1 for 
discrete distributions.. 


3. Ifa model only has R-side residual effects, then G = 0 and the R matrix is user-specified. All 
covariance parameters in R are estimated using the REPL method, defined in “Estimation”. 


4. Ifamodel has both G-side and R-side effects, all covariance parameters in G and R are jointly 
estimated using the REPL method. 


For the negative binomial distribution, there is the ancillary parameter k, which is first estimated 
by the ML method, ignoring random and residual effects, then fixed to that estimate while other 
regression and covariance parameters are estimated. 


Fixed effects transformation 


To improve numerical stability, the X matrix is transformed according to the following rules. 


The ith row of X is x; = (aj, .. san i=1,...,n with x;; = 1 if the model has an intercept. 
Suppose x* is the transformation of x; then the jth entry of x * is defined as 
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where Cj and sj are centering and scaling values for «;;, respectively, for j=1,...,.9 and choices 
of cj and sj, are listed as follows: 


a. For anon-constant continuous predictor or a derived predictor which includes a continuous 


predictor, if the model has an intercept, c; = 0 and c; = 7;,7 #1, where 7; is the sample 
n —- — 
mean of the jth predictor, 7; = x», fixj; and s; = 1 and s; = Vs, ,j #1, where V sz, is 
i=1 eS 
. . . . « = 5 B 
the sample standard deviation of the jth predictor and s . = nod, fi(aij — Tj) Note 
: F =] 

that the intercept column is not transformed. If the model has no intercept, c; = Q and 

E [2 =2 ; 

$j = \ 8, t vj. 
b. For a constant predictor x;; = a # 0,Vi, cj; = 0 ands; =a, that is, scale it to 1. 
c. For a dummy predictor that is derived from a factor or a factor interaction, c; = Qands; = 1; 


that is, leave it unchanged. 


Estimation 


We estimate GLMMs using linearization-based methods, also called the pseudo likelihood 
approach (PL; Wolfinger and O’Connell (1994)), penalized quasi-likelihood (PQL; Breslow 

and Clayton (1993)), marginal quasi-likelihood (MQL; Goldstein (1991)). They are based on 

the similar principle that the GLMMs are approximated by an LMM so that well-established 
estimation methods for LMMs can be applied. More specifically, the mean target function; that is, 
the inverse link function is approximated by a linear Taylor series expansion around the current 
estimates of the fixed-effect regression coefficients and different solutions of random effects (0 

is used for MQL and the empirical Bayes estimates are used for PQL). Applying this linear 
approximation of the mean target leads to a linear mixed model for a transformation of the original 
target. The parameters of this LMM can be estimated by Newton-Raphson or Fisher scoring 
technique and the estimates then are used to update the linear approximation. The algorithm 
iterates between two steps until convergence. In general, the method is a doubly iterative process. 
The outer iterations are to update the transformed target for an LMM and the inner iterations are to 
estimate parameters of the LMM. 


It is well known that parameter estimation for an LMM can be based on maximum likelihood 
(ML) or restricted (or residual) maximum likelihood (REML). Similarly, parameter estimation 
for a GLMM in the inner iterations can based on maximum pseudo-likelihood (PL) or restricted 
maximum pseudo-likelihood (REPL). 


Linear mixed pseudo model 


Following Wolfinger and O’Connell (1993), a first-order Taylor series of p in (1) about 3 and 
yields 


px fit (go) (x5 +ZA+ 0) x(3 = 3) 4+2(y —4 | 
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where (972) (x3 + 24+ 0) is a diagonal matrix with elements consisting of evaluations of 


, ag ; —1 
the 1st derivative of g~'. Since (g~) (x3 + ZY 4 0) = C (ji)) , this equation can be 
rearranged as 


g (jt) (u — ft) + XG4Z4F ~ XB4Zy 
If we define a pseudo target variable as 
= g (july — ft) + X6+4+Z75 = g (u(y ease. 


then the conditional expectation and variance of v, based on E (yy )and var (yy) = A’/*RA'’?, 
are 


E(vly) =9 (fi)(u— fi) + XB4Z5 
var(vly) = g (ji)Ag RAZ? 9 (ji) 


where ie Aes diag |(V (ji;)/wi)' aI = Dian: 
Furthermore, we also assume v|y is normally distributed. Then we consider the model of v 


as a weighted linear mixed model with fixed effects B, random effects y ~ (0, G), error terms 
Ew N (0,9 ji Ai, "RAY? '(ji)); because var(e) = var(v{y), and diagonal weight matrix 


W= Ay, ia ( ji)| . Note that the new target v (with O if an offset variable exists) is a Taylor 
series approximation of the linked target g(y ) The estimation method of unknown parameters 
of B and 0, which contains all unknowns in G and R, for traditional linear mixed models can 
be applied to this linear mixed pseudo model. 


The Gaussian log pseudo-likelihood (PL) and restricted log pseudo-likelihood (REPL), which 
are expressed as the functions of covariance parameters in 0, corresponding to the linear mixed 
model for v are the following: 


(8;v) = -5 In|V (6)| — 51(0) v(6)~41(6) — > m (2n) 
T 


(p(0;v) = -5 In |V (8)| — (0) Vv (0) !r(0) — In xT vo) *x| = “ote In (277) 


where 

V (8) = ZG (8) Z + W-!/?R (8) W-!/2,7 (8) =v — x(xTV(0)"!x) x! v(0)-"v =v —3.N 
denotes the effective sample size, and px denotes the rank of the design matrix of X or the number 
of non-redundant parameters in X. Note that the regression parameters in 8 are profiled from the 
above equations because the estimation of B can be obtained analytically. The covariance 
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parameters in @ are estimated by Newton-Raphson or Fisher scoring algorithm. Following the 
tradition in linear mixed models, the objection functions of minimization for estimating 8 would 
be —2/(0;v) or —2/,,(0; v). Upon obtaining 6, estimates for B and y are computed as 


= (xTv (8) “x) _xTy (0) re 


4 = Gz!v (4) 


> 


where (3 is the best linear unbiased estimator (BLUE) of B and 4 is the estimated best linear 
unbiased predictor (BLUP) of y in the linear mixed pseudo model. With these statistics, v and 
\V are recomputed based on /i and the objective function is minimized again to obtain updated 
6. Iteration between —2((0;v ) and the above equation yields the PL estimation procedure and 
between —2/;,(0; v) and the above equation the REPL procedure. 


There are two choices for 7 (the current estimates of ): 
1. +4 for PQL; and 
2. 0 for MQL. 


On the other hand, 3 is always used as the current estimate of the fixed effects. Based on the two 
objective functions (PL or REPL) and two choices of random effect estimates (PQL or MQL), 4 
estimation methods can be implemented for GLMMs: 


1. PL-PQL: pseudo-likelihood with 4=4; 


2. PL-MQL: pseudo-likelihood with 4= 0; 
3. REPL-PQL: residual pseudo-likelihood with 4=4; 
4. REPL-MQL: residual pseudo-likelihood with 4=0. 
We use method 3, REPL-PQL. 
Iterative process 


The doubly iterative process for the estimation of 0 is as follows: 


1. Obtain an initial estimate of u, u. Specifically, uP = (yjm; + 0.5)/(m; + 1) for a binomial 
distribution (y; can be a proportion or 0/1 value) and yu? = y; for a non-binomial distribution. 
Also set the outer iteration index j = 0. 


2. Based on ji, compute 


Puce 7 ag = pe —2 
v= g(fi) —O+9'(ii)(y — fi) and W = AR] y(ii)] 
Fit a weighted linear mixed model with pseudo target v, fixed effects design matrix X, random 
effects design matrix Z, and diagonal weight matrix !’. The fitting procedure, which is called 


the inner iteration, yields the estimates of 8, and is denoted as 0’. The procedure uses the 
specified settings for parameter, log-likelihood, and Hessian convergence criteria for determining 
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convergence of the linear mixed model. If j = 0, go to step 4; otherwise go to the next step. See 
“MIXED Algorithms” for more information on fitting the linear mixed model. 


3. Check if the following criterion with tolerance level is satisfied: 


|oF'_g(F-)) 
max; { 2 x r = | r — | <8. 


If it is met or maximum number of outer iterations is reached, stop. Otherwise, go to the next step. 


4. Compute 6 by setting § = 0) then set 3 = 3. Depending on the choice of random effect 
estimates, set 4=%. 


5. Compute the new estimate of p by 
m =9"1(X3 | ZA | 0), 


set j =j +1 and goto step 2. 


Wald confidence intervals for covariance parameter estimates 


Here we assume that the estimated parameters of G and R are obtained through the above doubly 
iterative process. Then their asymptotic covariance matrix can be approximated by 2H ~+, where 
H is the Hessian matrix of the objective function —2/(0 y pr —2/,,(8; v)) evaluated at §. The 
standard error for the ith covariance parameter estimate in the @vector, say @,, is the square root of 
the ith diagonal element of 2H~?. 


Thus, a simple Wald’s type confidence interval or test statistic for any covariance parameter 

can be obtained by using the asymptotic normality. However, these can be unreliable in small 
samples, especially for variance and correlation parameters that have a range of {(0),0c) and 

[—1, 1] respectively. Therefore, following the same method used in linear mixed models, these 
parameters are transformed to parameters thathave range (—0oo, oo). Using the delta method, these 
transformed estimates still have asymptotic normal distributions. 


For variance type parameters in G and R, such as c? in the autoregressive, autoregressive moving 
average, compound symmetry, diagonal, Toeplitz, and variance components, and @;; in the 
unstructured type, the 100(1 — a)% Wald confidence interval is given, assuming the variance 
parameter estimate is ? and its standard error is se(a*) from the corresponding diagonal element 
of 2H~', by 


a2 


exp (In (67) + 1 g/2-67? - se(6?)) 


For correlation type parameters in G and R, such as p in the autoregressive, autoregressive moving 
average, and Toeplitz types and » in the autoregressive moving average type, which usually come 
with the constraint of |p| < 1, the 100(1 — a)% Wald confidence interval is given, assuming the 
correlation parameter estimate is 4 and its standard error is ()) from the corresponding diagonal 
element of 2H~1, by 


1 n2\—1 
tanh (tanh 1p) + “_a/2° (1- p) 


é' se()) 
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. _ _ exp(r)—exp(—.r) 
where tanhz = 35 (z)texp(=a) 


hyperbolic tangent, respectively. 


and tanh” *e = 41n fed are hyperbolic tangent and inverse 


7 


For general type parameters, other than variance and correlation types, in G and R, such as o, in 
the compound symmetry type and 6; ;,i 4 j, (off-diagonal elements) in the unstructured type, no 
transformation is done. Then the 100(1 — «)% Wald confidence interval is simply, assuming the 
parameter estimate is o, and its standard error is se(c,) from the corresponding diagonal element 
of 2H~', 


(o1 — 2-a/2° se(1)),71 T 21~-a/2° se(a1)) 


The 100(1 — «)% Wald confidence interval for © is 

(exp (7 — Z1~a/207) ,exp (F + 21~-a/267) ) 

where 7 = In(@). 

Note that the z-statistics for the hypothesis Ho; :@; —0, where 6; is a covariance parameter in 


8 vector, are calculated; however, the Wald tests should be considered as an approximation and 
used with caution because the test statistics might not have a standardized normal distribution. 


Statistics for estimates of fixed and random effects 


The approximate covariance matrix of (8 — B,f? — y) is 


_ PX PRT XT Rs=1Z ee, a 
Y rn\—l = 11 y 


where R* = var(vly)= g (ji)A "RAY! 4 (ji) is evaluated at the converged estimates and 


Ca= (xTV-tx) | 
Cj G7 VX 
. as ; 
Co = (z'R-1z+G-1) oF Cn -176 


Statistics for estimates of fixed effects on original scale 


If the X matrix is transformed, the restricted log pseudo-likelihood (REPL) would be different 
based on transformed and original scale, so the REPL on the transformed scale should be 
transformed back on the final iteration so that any post-estimation statistics based on REPL can 
be calculated correctly. Suppose the final objective function value based on the transformed and 
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original scales are —2/;,(0;v) and —2/,,(0; v), respectively, then —2/,,(0; v) can be obtained 
from —2/;,(0; v) as follows: 


—2(p(8;v) = —2¢;,(8;v) — 21n |A| 


Because REPL has the following extra term involved the X matrix 


—} In LTV (0) *x* - 


—}in|(XA) V(0) ‘xa 
=e 15) 

(0) *x| +In|A| + In AT) 
XV(0 y*x| —In|A| 


1 
= ai, 


then —1 In XV(0) ke —1in [x*Tv(a) *x*| +In|A| and (;,(0;v) = (;,(0;v) + In |A]. Please 
note fist PL values are the ae whether the X matrix is transformed or not. 


In addition, the final estimates of B, C11, C21 and C9 are based on the transformed scale, denoted 
is 3*,CT,,C3, and C35, respectively. They are transformed back to the original scale, denoted as 
3.C',,,Co, and C',,, respectively, as follows: 


= As’, 
= aE T 
Cy = ACA," 
eee ae 
Cop = Chp. 


Note that A could reduce toS —'; hereafter, the superscript * denotes a quantity on the transformed 
scale. 


Estimated covariance matrix of the fixed effects parameters 


Two estimated covariance matrices of the fixed effects parameters can be calculated: model-based 
and robust. 


The model-based estimated covariance matrix of the fixed effects parameters is given by 
m = Cy 


The robust estimated covariance matrix of the fixed effects parameters for a GLMM is defined as 
the classical sandwich estimator. It is similar to that for a generalized linear model or a generalized 
estimating equation (GEE). If the model is a generalized linear mixed model and it is processed by 
subjects, then the robust estimator is defined as follows 
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S 
es Ty-1..,Ty-1 
yr=2m ) Xx; V; rit; Vi; X; | =m 
j=l 


where 7; = v; — X;3. 


Standard errors for estimates in fixed effects and predictions in random effects 


Let 3; denote a non-redundant parameter estimate in fixed effects. Its standard error is the square 
root of the ith diagonal element of Xm or <r, 


03, = Vii 
The standard error for redundant parameter estimates is set to a system missing value. 


Let 4; denote a prediction in random effects. Its standard error is the square root of the ith 
diagonal element of C'22: 


Gy, = \f C22,i3 


Test statistics for estimates in fixed effects and predictions in random effects 


The hypothesis Ho; : 3, = Gs tested for each non-redundant parameter in fixed effects using the 
t statistic: 


which has an asymptotic t distribution with v degrees of freedom. See “Method for computing 
degrees of freedom” for details on computing the degrees of freedom. 


Wald confidence intervals for estimates in fixed effects and predictions in random 
effects 


The 100(1 — «)% Wald confidence interval for 3; is given by 


@ a ty a/2% 8, ) Bj vi lv,0/259,) 
where f,,.4/2 is the (1 — a/2) 100th percentile of the ¢,, distribution. 


For some models (see the list below), the exponentiated parameter estimates, their standard 
errors, and confidence intervals are computed. Using the delta method, the estimate of exp ((3;) is 


exp ( 6B; i, the standard error estimate is (exp (4:) ‘OB, ) and the corresponding 
100(1 — a@)% Wald confidence interval for exp (f;) is 


(exp (4 = tr.0/2%8.) » EXD (3 a ty.a/288.) )- 
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The list of models is as follows: 
1. Logistic regression (binomial distribution + logit link). 
2. Nominal logistic regression (nominal multinomial distribution + generalized logit link). 
3. Ordinal logistic regression (ordinal multinomial distribution + cumulative logit link). 
4. Log-linear model (Poisson distribution + log link). 


5. Negative binomial regression (negative binomial distribution + log link). 


Testing 
After estimating parameters and calculating relevant statistics, several tests for the given model 


are performed. 


Goodness of fit 


Information criteria 


Information criteria are used when comparing different models for the same data. The formulas 
for various criteria are as follows. 


Finite sample corrected (AICC) a2) 2a 


Bayesian information criteria (BIC) —2€+ dlIn(N) 


where ¢ is the restricted log-pseudo-likelihood evaluated at the parameter estimates. For REPL, 
N is the effective sample size minus the number of non-redundant parameters in fixed effects 


n 
O- f; — px) and d is the number of covariance parameters. 
i=1 


Note that the restricted log-pseudo-likelihood values are of the linearized model, not on the 
original scale. Thus the information criteria should not be compared across models with different 
distribution and link function and they should be interpreted with caution. 


Tests of fixed effects 


For each effect specified in the model, a type III test matrix L is constructed and Ho: LB = 0 is 
tested. Construction of L and the generating estimable function (GEF) is based on the generating 
matrix H,, = xl px xl px, where ¥ = diag( jw ),... /,,w,,), such that LjB is estimable; that 
is, L; = L,H,, It involves parameters only for the given effect and the effects containing the given 
effect. For type III analysis, L does not depend on the order of effects specified in the model. If 
such a matrix cannot be constructed, the effect is not testable. 


Then the L matrix is then used to construct the test statistic 
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sTLT (pyLT)'L3 


Fe 


F= 


where r,. = rank LYL! . The statistic has an approximate F distribution. The numerator 
degrees of freedom is ;. and the denominator degrees of freedom is vy. See “Method for computing 
degrees of freedom” for details on computing the denominator degrees of freedom. 


In addition, we test a null hypothesis that all regression parameters (except intercept if there is 
one) equal zero. The test statistic would be the same as the above F statistic except the L matrix is 
from GEF. If there is no intercept, the L matrix is the whole GEF. If there is an intercept, the L 
matrix is GEF without the first row which corresponds to the intercept. This test is similar to the 
“corrected model” in linear models. 


Estimated marginal means 


There are two types of estimated marginal means calculated here. One corresponds to the 
specified factors for the linear predictor of the model and the other corresponds to those for the 
original scale of the target. 


Estimated marginal means are based on the estimated cell means. For a given fixed set of factors, 
or their interactions, we estimate marginal means as the mean value averaged over all cells 
generated by the rest of the factors in the model. Covariates may be fixed at any specified value. 
If not specified, the value for each covariate is set to its overall mean estimate. 


Estimated marginal means are not available for the multinomial distribution. 


Estimated marginal means for the linear predictor 


Calculating estimated marginal means for the linear predictor 


Estimated marginal means for the linear predictor are based on the link function transformation, 
and constructed such that LB is estimable. 


Suppose there are r combined levels of the specified categorical effect. This rx1 vector can be 
expressed in the form u = L,3. The variance matrix of u is then computed by 


V(a)=LzLT 


The standard error for the jth element oft is the square root of the jth diagonal element of (1). 
Let the jth element of «1 and its standard error be «i; and c,,,, respectively, then the corresponding 
100(1 — «)% confidence interval for Up = laeeghs is given by 


a; tyija/2Fu, 
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where ¢,,;_,, /2 is the (1 — a/2)100¢hpercentile of the t distribution with v/ degrees of freedom. 
See “Method for computing degrees of freedom” for details on computing the degrees of 
freedom. 


Comparing estimated marginal means for the linear predictor 


We can compare estimated marginal means for the linear predictor based on a selected contrast 
type, for which a set of contrasts for the factor is created. Let this set of contrasts define matrix 
C used for testing the hypothesis Hy) : Cu = 0. An F statistic is used for testing given set of 
contrasts for the factor as follows: 


(Cu ie (cv u \-T) - (Cu) 


F = 7 


which has an asymptotic F distribution with r; degrees of freedom, where r; = rank( CV( uC"). 
See “Method for computing degrees of freedom” for details on computing the denominator 
degrees of freedom. The p-values can be calculated accordingly. Note that adjusted p-values 
based on multiple comparisons adjustments won’t be computed for the overall test. 


Each row cl of matrix C is also tested separately. The estimate for the ith row is given by ghia and 


its standard error by vcivi u)c;. The corresponding 100(1 — «)% confidence interval is given by 


Cour toia/27cu, 


The test statistic for Hy : chu = 0is 


Tou 


It has an asymptotic t distribution. See “Method for computing degrees of freedom” for details 
on computing the degrees of freedom. The p-values can be calculated accordingly. In addition, 
adjusted p-values for multiple comparisons can also computed. 


Estimated marginal means in the original scale 


Estimated marginal means for the target are based on the original scale. As a conditional predictor 
defined by Lane and Nelder (1982), estimated marginal means for the target are derived from 
those for the linear predictor. 


Calculating estimated marginal means for the target 


The estimated marginal means for the target are defined as 


M = ee (L3) =g (a) 
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The variance of estimated marginal means for the target is 


7 ‘) —1 oe ees ,) -~ls- : 
v(M) _ tiag( 2) LEL diag ( gy | ‘2 
; Ou; _ 06; 


where diag(Jg~'(t;)/Ou,) is arxr matrix and 0g~!(i;)/Ou, is the derivative of the inverse of 
the link with respect to the jth value in a and @g~!(ii;)/Ou, =1/g (17;) where g’ (.1;) is 
from Table 45-4. 


The 100(1 — «)% confidence interval for M/;,i = 1,...,1.is given by 


Note: M is estimated marginal means for the proportion, not for the number of events when 
events and trials variables are used for the binomial distribution. 


Comparing estimated marginal means for the target 


This is similar to comparing estimated marginal means for the linear predictor; just replaca with 
M and V (a) with V M). For more information, see the topic “Estimated marginal means for the 
linear predictor”. 


Multiple comparisons 


The hypothesis Hy) : Cu=0 can be tested using the multiple row hypotheses testing technique. 
Let c} be the ith row vector of matrix C. The ith row hypothesis is /7,; : c/ u = 0. Testing Hp is the 
same as testing multiple non-redundant row hypotheses { H,;, Se , simultaneously, where R is the 
number of non-redundant row hypotheses, and /7;; represents the ith non-redundant hypothesis. A 
hypothesis Ho; is redundant if there exists another hypothesis Hoj, 3 #4 such that, = acj,a #0. 


Adjusted p-values. For each individual hypothesis Hpo;, test statistics can be calculated. 
Let p; denote the p-value for testing Ho; and p; denote and adjusted p-value. The 
conclusion from multiple testing is, at level o (the family-wise type I error), 


reject Hy; : ce) u=0 if p'¥< a; 
reject 7) :Cu=0 if minj(p*) <a. 


Several different methods to adjust p-values are provided here. Please note that if the adjusted 
p-value is bigger than 1, it is set to 1 in all the methods. 


Adjusted confidence intervals. Note that if confidence intervals are also calculated for the 
above hypothesis, then adjusting confidence intervals is required to correspond to adjusted p- 
values. The only item needed to be adjusted in the confidence intervals is the critical value 
from the standard normal distribution. Assume that the original critical value is z;_4/2 and the 
adjusted critical value is z*. 
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LSD (Least Significant Difference) 
The adjusted p-values are the same as the original p-values: 


Pi, = Pi 


The adjusted critical value is: 


Sequential Bonferroni 
The adjusted p-values are: 


Rp) t= 1 


* = . in 
“ feces tL) PH Ply) 422 


The adjusted critical values will correspond to the ordered adjusted p-values as follows: 
ty, 2 if¢=1 

if pi=(h —t+1)p ij) fori = 2 

if pP{;)P(;-1) for 1 2 2 


Sequential Sidak 


The adjusted p-values are: 


R , 
1 (tA) pi 
Pu) : P = 


R-i4+l1 ~ 
max (1 ~ (1 — py) Pin) a22 


The adjusted critical values will correspond to the ordered adjusted p-values as follows: 


E (i) teen if7 = 1, 
: 2 : * ) A F, 
tty = t pte (=) if p/;) = (R—i+1)py) fori 2 2. 
a - * —)* 7, > 
ty ,tana if p/;) Pi; —1) fo1 122 
1 —p,; 
where x = n(-rii-)) 
In (1=p;)) 


Method for computing degrees of freedom 


Residual method 


The value of degrees of freedom is given by NV — rank:(X), where N is the effective sample size 
and X is the design matrix of fixed effects. 
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Satterthwaite’s approximation 


First perform the spectral decomposition LCL? = [7 DI where F is an orthogonal matrix of 
eigenvectors and D is a diagonal matrix of eigenvalues. If !,,, is the mth row of TL, d,,, is the 
mth eigenvalues and 


Ol, Cl ; 3 : : ~ F 
where g,, = —57~“|,_, and X, is the asymptotic covariance matrix of 6 obtained from the 


Hessian matrix of the objective function; that is, u, = 2 H?. If 


qd 
Vin ~ 9) 
E= ——~I (Vm, > 2) 


m=1 Vm — 2 
then the denominator degree of freedom is given by 


2E 
E-q 


_— 


Note that the degrees of freedom can only be computed when E>dq. 


Scoring 


For GLMMs, predicted values and relevant statistics can be computed based on solutions of 
random effects. PQL-type predictions use 4 as the solution for the random effects to compute 
predicted values and relevant statistics. 


PQL-type predicted values and relevant statistics 


Predicted value of the linear predictor 


T 


Standard error of the linear predictor 


On= xiEx, + gE C2924 + 277 Caxi, 
Predicted value of the mean 
Gg (x3 + 7h + 01) 


For the binomial distribution with 0/1 binary target variable, the predicted category c(x;) is 


(x;) = 1 (orsucess) if {ij = 0.5 
i) =) 0 (or failure) otherwise 


Approximate 100(1—a)% confidence intervals for the mean 
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g (xp 8 ra 0; +t, a/2n) 


Raw residual on the link function transformation 


Pi = Vi Mi 


Raw residual on the original scale of the target 
r= Yi — bi 
Pearson-type residual on the link function transformation 


nyt 


var(vily) 


1/2 0' (1) where 


where var(v;|y) is the ith diagonal element of var (v y and var(v y) = g' (fi An? RA’ 


jis an nx 1 vector of PQL-type predicted values of the mean. 


Pearson-type residual on the original scale of the target 


R 


Jvar(yily) 


where var(y;|y) is the ith diagonal element of var(y) = A RAN? and fin, = fi. 


Classification Table 


Suppose that c (i =) is the sum of the frequencies for the observations whose actual target 


category is j (as row) and predicted target category is j (as column), j,j =1,---,-/ (note that J = 
2 for binomial), then 


oe ee 3 ii (w= pen =7 ) 


where J (-) is indicator function. 


; th 
Suppose that p (i oY) ) is the (i J ) element of the classification table, which is the row 


percentage, then 
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: = ems 100¢ 
Ptotal = 7 7 x LOO % 


Nominal multinomial distribution 


The nominal multinomial distribution requires some extra notation and explanation. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


S Number of super subjects. 

T. Number of cases in the sth super subject. 

Yst Nominal categorical target for the tth case in the sth super subject. Its category values 
are denoted as 1, 2, and so on. 

J The total number of categories for target. 

Us . : 

Ust Dummy vector of ys, yst = (Ysist tts Usted ee where ys; = Lif ys: = 9, 
otherwise Yyst,) = 0. The superscript T means the transpose of a matrix or vector. 

Ys T 

. T T : 
y,= Gas) »8=1,-+-,5. 


T 


i= (ui.---.u2) 


Tst,j Probability of category j for the tth case in the sth super subject; that is, 
Tst,j = P(Yyst = 7). 


Wst 

Met = (Wst1,°++5 Wst,J—1) 
Ts T 

7 = (xi war, ) s=l S 
7 T 


Model 
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Linear predictor value for category j of the tth case in the sth super subject. 


Ast = (sty tty Yst.J—1) 
T 
no = (nd deed nd.) es 


T 
(n (J-1)) x 1 vector of linear predictor. 1 = (nt. vey ns) 


px 1 vector of predictor variables for the tth case in the sth super subject. The first 
element is 1 if there is an intercept. 


X, = (7-1 @ Fe1,°°, 13-1 @ Fs7,) 8 =1,°+°, 8 


(n (J-1)) x (J-1)p design matrix of fixed effects, ¥ = (X7,---,X2 ie 
rx 1 vector of coefficients for the random effect corresponding to the tth case in the 
sth super subject. 


Zs = (Ly-1 © 2s1,°°:, 47-1 B 2s7,) ,8 = 1, S 

Design matrix of random effects, Z = © Z,, where © is the direct sum of matrices. 
s=1 

nx 1 vector of offsets, O = (011,°++,017y,°++, 087. )", where o., is the offset value of 


the tth case in the sth super subject. This can’t be the target (y) or one of the predictors 
(X). The offset must be continuous. 


O* =O 1), 1, where 1, is a length q vector of 1. 


p* 1 vector of unknown parameters for categoryj, 9; = (3j1,-+-,8jp)',f =ly-+, J. 
The first element in 3; is the intercept for the category j, if there is one. 


B= (37, ae Gra) 


rx 1 vector of random effects for category j in the sthsuper subject, ; =1,...,./—1. 


T T f 
Random effects for the sth super subject, 7, = (4 gs 3 V6 ve) : 
T 
Scale weight of the tth case in the sth super subject. It does not have to be integers. If 
it is less than or equal to 0 or missing, the corresponding case is not used. 


. : T 
nx 1 vector of scale weight variable, @ = (w11,-++,WiTy, +++, Ws1,6t,eSTs) - 


Frequency weight of the tth case in the sth super subject. If it is a non-integer value, it 
is treated by rounding the value to the nearest integer. If it is less than 0.5 or missing, 
the corresponding cases are not used. 
nx 1 vector of frequency count variable, f = (fii,---, fir,,+++,fsij.+++) fsvs ve 
n 
Effective sample size, NV = SS fi. If frequency count variable f is not used, N = n. 
i=1 


The form of a generalized linear mixed model for nominal target with the random effects is 


n= g(E(y)|\y) =X84+274+0%° 
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where 7 is the linear predictor; X is the design matrix for fixed effects; Z is the design matrix for 
random effects; y is a vector of random effects which are assumed to be normally distributed with 
mean 0 and variance matrix G; g (.) is the logit link function such that 


7 oh Tst.j 
Nst,j = 9 (Tst,j) = log ( t ‘) 
" St, 


And its inverse function is 


exp (Hee,5) j=1,::-,J—-1, 


J-1 y 
1+ S > exp (nst,k) 
k=1 


Tst,j = 9 (nstj) = mt 4 


J—-1 ? 
1+ y exp (nst,k) 
k=1 


jad. 


The variance of y, conditional on the random effects is 


: 1/2 1/2 
Var (yly) = Ay? RAY! 


S Ts 
where A, = @ D (ai ag (st) — Test wd) /wys, and R = I which means that R-side effects 
s=1t=l1 
are not supported for the multinomial distribution. © is set to 1. 


Estimation 


Linear mixed pseudo model 


Similarly to “Linear mixed pseudo model”, we can obtain a weighted linear mixed model 
v=Xb+Zy+.e 


where v =D “(y — *) + 9(%) —O* and errorterms ¢ ~ N(0,D-4AY? RAY? D =) with 


S Ts S Ts dg~1 Ns S Ts 
D= © © Ds= ® O ey Mi) _ © © (diag (se) — F173) 
s-lt=l s-lt=l )st s-l1lt=l1 
and 
S Ts 
Ap= © (diag (fet) — Fatt h) [eos 
s=lt= 


And block diagonal weight matrix is 
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: a 
W = DA;z'D= QB @O weDet. 
s=lt=1 


The Gaussian log pseudo-likelihood (PL) and restricted log pseudo-likelihood (REPL), which 
are expressed as the functions of covariance parameters in 8, corresponding to the linear mixed 
model for v are the following: 

N 


Vv(0) ‘r(@) — > In (2) 


((0;v) = = In [V (8)| — =1(0)" 


(p(0;v) = -5 In|V (8)| — (8) 'V(0) 1108) = In xT vo) *x| = Ache In (277) 


where V (0) = ZG (0) Z? + W~!/?R (0) W-!/?,r(8) = v — X3,N denotes the effective sample 
size, and p,. denotes the total number of non-redundant parameters for 3. 


The parameter @ can be estimated by linear mixed model using the objection function —2/(0; v)or 
—2¢ (0; v), 6 and y are computed as 


= (x (8) x) xT (0) 


= azTv (0) 


& 


aly. 


Iterative process 


The doubly iterative process for the estimation of @is the same as that for other distributions, if we 
replace ji and ¥ 3+ 27+ + O with 7 and Y 3+ Z+ + O* respectively, and set initial estimation 
of 7 as 


(0) __ ¥4 1/J 
2 


” 


For more information, see the topic “Iterative process”. 


Post-estimation statistics 


Wald confidence intervals 


The Wald confidence intervals for covariance parameter estimates are described in “Wald 
confidence intervals for covariance parameter estimates”. 


Statistics for estimates of fixed and random effects 


Similarly to “Statistics for estimates of fixed and random effects”, the approximate covariance 
matrix of (6 — 8,7 — vy) is 
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cS = 


xTRUX —XTRIZ tg og 
: Sat || xe Gite Coy 
ZT Rly ZT Relz 4 G(0) Cy, Core 


Where R* = var (vl\y) = D-! A, * RA; *p-1 with p= 


(xTV-X) | 


Cor = HOF VY XC 


S Ty 
gag (iag (#0) = tin)? a 


a> 
paar 
a 

rl 


: nue : 1 . pee 
Oe Ga aoe ct) Curry 96 


Statistics for estimates of fixed and random effects on original scale 


If the fixed effects are transformed when constructing matrix X, then the final estimates of (3, 
C11, Co, and C22 above are based on transformed scale, denoted as _ 33°, Ce C3, and Cy 
respectively. They would be transformed back on the original scale, denoted as 3,,C',, Cx 
and C'., respectively, as follows: 


Estimated covariance matrix of the fixed effects parameters 


Model-based estimated covariance 
Lim = Cy 


Robust estimated covariance of the fixed effects parameters 


S 

ye Tyy-ls eTy-ly \»s 
Up — Lm XxX; V5 Psls V; Xs Um 

s=1 


Testing 
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where ?, = v, — X,3, andv, isa part of v corresponding to the sth super subject. 


Standard error for estimates in fixed effects and predictions in random effects 


Let Bye denote a non-redundant fixed effects parameter estimate. Its standard error is the square 
root of the ((j — 1)p+ c) th diagonal element of © 


= /, 
F Bie = 4/7 ((j-1)pte),((j-1)p+e) 


The standard error for redundant parameter estimates is set to system missing value. 


Similarly, let +; denote the ith random effects prediction. Its standard error is the square root 
of the ith diagonal element of C,: 


Test statistics for estimates in fixed effects and predictions in random effects 


Test statistics for estimates in fixed effects and predictions in random effects are as those described 
in “Statistics for estimates of fixed and random effects”. 


Wald confidence intervals for estimates in fixed effects and random effects predictions 


Wald confidence intervals are as those described in “Statistics for estimates of fixed and random 
effects”. 


Information criteria 


These are as described in “Goodness of fit’. 


Tests of fixed effects 


For each effect specified in the model, a type III test matrix L is constructed from 

: : oon T 
the generating matrix H,, = (x? Qx) a7 Qa where, # = (a{),--+,04,,°--"$p,) and 
Q = diag (wi1,+++,#17,,+++,Wsi,+++,wsr.). Then the test statistic is 


BT LAT (L*DL*T) "LB 


le 


F= 


where r. = rank(L*SL*") and L* = Ij; & L. The statistic has an approximate F distribution. 
The numerator degrees of freedom is r,, and the denominator degree of freedom is u For more 
information, see the topic “Method for computing degrees of freedom”. 
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Scoring 


PQL-type predicted values and relevant statistics 


(J — 1) x 1 predicted vector of the linear predictor 

A — T fa) — T A = 

Ast = (17-1 @ Tst) 8 + (Lg-1 @ 2st)” Ys + 1y-1 @ Ose 

Estimated covariance matrix of the linear predictor 

Da. = (Ly-1 @ Get)” D(Ly_1 ® Lot) + (1g-1 @ Zot)” C8(Ly_1 ® Zs 
fee (7-1 ® st) U(Lz-1 @ Tet) + (Lg-1 @ Zst)” C3q(Ty-1 © Zst) 


6 LT Ais : ©, T (As r : 
+(I7-1 ® Zst)” C$, (L7-1 ® st) + (7-1 ® st) (¢3,) (Ij-1 ® Zst) 


where C3, is a diagonal block corresponding to the sth super subject, the approximate covariance 
matrix of 4, — +,; C5, is a part of C’,, corresponding to the sth super subject. 


The estimated standard error of the jth element in 7).+, 7),+,;, is the square root of the jth diagonal 
element of ©j,,, 


THerg ~ VO Nee II 
Predicted value of the probability for category j 


exp (Hats) j=1,---,J—1, 


J-1 ? 
14 > exp (fist,k) 


k=1 


P lyn 
Tst,j =9 (Nst,j) = 1 j= J. 


J-1 


14. exp (‘ist,k:) 


k=1 


Predicted category 


C(Xst) = arg max 7st,j, 
: : 


If there is a tie in determining the predicted category, the tie will be broken by choosing the 
ST. 


category with the highest V;= x ‘ fryst,j. Hf there is still a tie, the one with the lowest 


s=1 t=1 
category number is chosen. 


Approximate 100(1—a)% confidence intervals for the predicted probabilities 


The covariance matrix of 7,; can be computed as 


? ~ gin <0 =i fs 
Cov (tts) = Vo" (fist) X9,.V 97° (Ast) 
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where 
Ofer 4 Of st.3-1 OT sta 
1 2 a = 
Va (7st) = ; E ss 
Ofna Oftet,s—1 Of sts 
Oia —1 ONet 1 Oto s-1 
with 
Orstj _ J tsej 1 —Tsty) J =K 
Tristh st itet ke dF 


then the confidence interval is 
Tst.j + ty a/2Fm a = Lisiegh 


Tse, j 


where o2 teaale the jth diagonal element of Cou (7,,) and the estimated variance of 
agg = lors hi 


Ordinal multinomial distribution 


The ordinal multinomial distribution requires some extra notation and explanation. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


S Number of super subjects. 

T. Number of cases in the sth super subject. 

Yst Ordinal categorical target for the tth case in the sth super subject. Its category values 
are denoted as consecutive integers from 1 to J. 

J The total number of categories for target. 

Ys é : ; 

Yst Indicator vector of ysi, Ysr = (Yst.iy+2 +5 Yor ae where ys; = Lif yor = 9, 
otherwise y.;,; = 0. The superscript T means the transpose of a matrix or vector. 

Ys 


y= Gigs ee 1,-++5S. 


y= Gisaal)y 


Ast. Cumulative target probability for category j for the tth case in the sth super subject; 
Asti = P (yst < J) a 

Ty. T TAT at 
A= (Cee . “13 ) ? where Xs = (asi: _ AST ) and st = (Ast,1) a »Ast,s—1)s 
s=1,...,Sandt=1.,..., ES: 


Tst,j Probability of category j for the tth case in the sth super subject; that is, 
Tst,j = P(yst = J) and Tst,j = Ast,j — Ast,j—-1- 
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Tst 


Ts 


Lin 


Model 


Wst — (Wet,1,°°*, Wst ei 
= (ai, my \ije=1 S 
laa. 


Linear predictor value for category j of the tth case in the sth super subject. 


Nst = (Nstytt ty Wst.J—1) 


ns = (nd oes sy aarres ere 


Re a 
(n (J-1)) x 1 vector of linear predictor. 7 = (ny yoy MS ) 


px 1 vector of predictors for the tth case in the sth super subject. 


rx 1 vector of coefficients for the random effect corresponding to the tth case in the 
sth super subject. 


nx 1 vector of offsets, O = (011, +++, 017,,°7+, 057: \" where o,,; is the offsetvalue of 
the tth case in the sth super subject. This can’t be the target (y) or one of the predictors 
(X). The offset must be continuous. 


O* =O 1), 1, where 1, is a length q vector of 1’s. 


J-1 x 1 vector of threshold parameters, y = (v1, 2,..., digaih and 
Wy i We Sees Cw 
px 1 vector of unknown parameters. 


T 
(J-1+p) x 1 vector of all parameters B= (vr pr ) 


Scale weight of the tth case in the sth super subject. It does not have to be integers. If 
it is less than or equal to 0 or missing, the corresponding case is not used. 


: : T 
nx 1 vector of scale weight variable, @ = (wii,--+,wiry,-++,Wsi;7t',WSTs)- 


Frequency weight of the ith case in the sth super subject. If it is a non-integer value, it 
is treated by rounding the value to the nearest integer. If it is less than 0.5 or missing, 
the corresponding cases are not used. 


nx 1 vector of frequency count variable, f = (fi1,-+-, fir,,+++,fsi,.+++, fsvs P 


Effective sample size, V = ye fi. If frequency count variable f is not used, N = n. 
i=1 

aiu1B a12B a13B 

direct (or Kronecker ) product of A and B, which is equal to | a21B_) ag2B_a23B 

a31B a32B a33B 


mx 1 vector of 1’s; 1,,, = (incor 


The form of a generalized linear mixed model for an ordinal target with random effects is 


n=gA)=XB+Zy+O° 
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where 77 is the expanded linear predictor vector; 4 is the expanded cumulative target probability 
vector; g(.) is a cumulative link function; X is the expanded design matrix for fixed effects 
arranged as follows 


X1 
x=[: ], 
Xs 

Xs1 
Xs = : 
Xsr, 


T,(J—1)x(J—1+p) 


te-(ur1 1119-2) 
si Ty-1 17-1 @ —Xge (J—1)x(J—1+4p) 


1 0 —xi 

0 w+. 1 —xf 
1 0 —Lst1 "°° —Lst.p 
0 1 st 1 —Lst.p 


T T 
B= (vt , p") = (4, se Va) p’) Z is the expanded design matrix for random effects 
arranged as follows 


yz 0 0 
Z= : nn) 
Zs 


0 
Leola 
Q— 

: zit) -1)x 


y is a vector of random effects which are assumed to be normally distributed with mean 0 and 
variance matrix G. 


Zst = fe 


The variance of y, conditional on the random effects is 


Var (y|y) = Al?R A? 


Ts 


where A,, = D ® (diag (71) — Tt ni) /ws+ and R = @I which means that R-side effects 
s=1t=l1 
are not supported for the multinomial distribution. © is set to 1. 
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Estimation 


Linear mixed pseudo model 


Similarly to “Linear mixed pseudo model”, we can obtain a weighted linear mixed model 


v=X$+Zy7+e 


where v = D7“! (y — #) + 9(X) — O* and error terms ¢ ~ N (0 D1 Al? RAL? (0~)") with 
y Ts S fi dq-1(#. S Le di. 
D= @ @Da= 9 @ Ww _ g g Ve 
s=1t=1 s—-1t—1 Use s=1t=1 Vist 
reer 
Bins 1 0 0 0 
_ OXzr,1 OX.e,2 0) 0 
Oise 1 Otjat,2 
Dst = : . ar : = 
i ; mh Oree, 3-2 a 
0 0 © Wangs 0 
Orst,5—2 OXst.s—1 
0 0 —_ Ofjse,—2 ONet,s—1 
and 
S Ts T 
Ai = D ‘) (diag (7st) a iit | /Wst- 
s=lf¢=1 


And block diagonal weight matrix is 
W=p'As'D 


The Gaussian log pseudo-likelihood (PL) and restricted log pseudo-likelihood (REPL), which 
are expressed as the functions of covariance parameters in corresponding to the linear mixed 
model for v are the following: 


Ty(0)~1(0) — > m (2) 


(0;v) = -5 In|V (6)| — (8) 


ae 


0p (8;v) = Sin IV (6)| — (0) v(6)~41(6) — = In 


xT vo) *x| = 
where V (0) = ZG(0) Z’ + W-'/?R (0) W-!/2,7(8) = v — XB,N denotes the effective sample 
size, and p, denotes the total number of non-redundant parameters for B. 


The parameter 6 can be estimated by linear mixed model using the objection function —2/(0; v)or 
—2¢p(8;Vv), B and ¥ are computed as 
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1 es 
xTy() VY 


B= (xT (8) x) 


4=G62"1 (0) 


Iterative process 


The doubly iterative process for the estimation of @is the same as that for other distributions, if 
we replace fi and XB + Zf + O with ft and XB + Zf + O* respectively, and set initial estimation 
of mas 


i (') Wee + 1/J 
7 2 


For more information, see the topic “Iterative process”. 


Post-estimation statistics 


Wald confidence intervals 


The Wald confidence intervals for covariance parameter estimates are described in “Wald 
confidence intervals for covariance parameter estimates”. 


Statistics for estimates of fixed and random effects 


C is the approximate covariance matrix of (B —-B.4- 7) and R* in C' should be 


R* = var ( v|4 ) —_ DAY? Ral? (phy 


Statistics for estimates of fixed and random effects on original scale 


If the fixed effects are transformed when constructing matrix X, then the final estimates of B, 
denoted as B*. They would be transformed back on the original scale, denoted as B, as follows: 


Wr 

ye : 
==(8)-(,) 0) 

B 


where 


Ij-1 1j-1® (cls) 
0 s-t 


A= 
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Testing 


Estimated covariance matrix of the fixed effects parameters 


The estimated covariance matrix of the fixed effects parameters are described in “Statistics for 
estimates of fixed and random effects”. 


Standard error for estimates in fixed effects and predictions in random effects 


Let 3,9 = 1 jeaye J —1, be threshold parameter estimates and 3;,i = 1,...,p. denote 
non-redundant regression parameter estimates. Their standard errors are the square root of the 
diagonal elements of Xm or Zr: oy, =/ojj andes, = \/O(7-1;7),J—1>1 Fespectively, where 


7;; is the ith diagonal element of £m or <r. 


Standard errors for predictions in random effects are as those described in “Statistics for estimates 
of fixed and random effects”. 


Test statistics for estimates in fixed effects and predictions in random effects 


The hypotheses Hy; : Wj =0,j =1,...,- J — 1, are tested for threshold parameters using the 
t statistic: 

V5 

Oy, 


Test statistics for estimates in fixed effects and predictions in random effects are otherwise as 
those described in “Statistics for estimates of fixed and random effects”. 


Wald confidence intervals for estimates in fixed effects and random effects predictions 


The 100(1 — «)% Wald confidence interval for threshold parameter is given by 
(v _ bey ex/2Frb,;» Vy tya/25u;) 


Wald confidence intervals are otherwise as those described in “Statistics for estimates of fixed and 
random effects”. 


The degrees of freedom can be computed by the residual method or Satterthwaite method. For the 


residual method, v = N — (.J — 1+ p,). For the Satterthwaite method, it should be similar to that 
described in “Method for computing degrees of freedom”. 


Information criteria 


These are as described in “Goodness of fit”, with the following modifications. 
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For REPL, the value of N is chosen to be effective sample size minus number of non-redundant 
n 
parameters in fixed effects, so f; —(J —1+p,), where p,. is the number of non-redundant 


i=1 
parameters in fixed effects, and d is the number of covariance parameters. 


n 


For PL, the value of N is effective sample size, S- /;, and d is the number of number of 


i=1 
non-redundant parameters in fixed effects, .J — 1 + p,, plus the number of covariance parameters. 


Tests of fixed effects 


For each effect specified in the model excluding threshold parameters, a type I or III test 
matrix Lj is constructed and Ho: LjB = 0 is tested. Construction of matrix Lj is based on 
matrix Hw = (xf 9x:) x? x 1, where X1 = (1, — X) and such that L;B is estimable. 
Note that L;B is estimable if and only if Lg = LgH., where Lg = (lo.L 6 )). Construction 
of Lg considers a partition of the more general test matrix L; = (L;(y),L;(f)) first, where 
Ey) Clyas, 1, 1) consists of columns corresponding to the threshold parameters and 


L,(B) is the part of Lj corresponding to regression parameters, then replace L ;(w) with their 
J-1 


sum Jy = x |, to get Lo. 


j=1 


Note that the threshold-parameter effect is not tested for both type I and III analyses and 
construction of L; is the same as in GENLIN. For more information, see the topic “Default Tests 
of Model Effects”. Similarly, if the fixed effects are transformed when constructing 

matrix X, then H,, should be constructed based on transformed values. 


Scoring 


PQL-type predicted values and relevant statistics 


(J —1) x 1 predicted vector of the linear predictor 
st = XB a Zst'Ys E Fel © Ost 
Estimated covariance matrix of the linear predictor 


: : so 3h 
Wie = Xt BX ge + Lethe Zp + ZerCH XS + Xp (C3) Zee 


where C’}, is a diagonal block corresponding to the sth super subject, the approximate covariance 
matrix of 4, — 7,3; C5, is a part of C’,, corresponding to the sth super subject. 


The estimated standard error of the jth element in 7s, ‘).:,;, is the square root of the jth diagonal 
element of 5 


Sip? 


Ferg — VO fee dI 
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Predicted value of the cumulative probability for category j 
Vet,g=9 (fet), J =1,.-.,7 —1 
with 4, , = 1. 
Predicted category 


C(Xst) = arg max 7st. j, 
: ; 


where Tst,j = Vet,j _ Vat gts 


If there is a tie in determining the predicted category, the tie will be broken by choosing the 
ST. 
category with the highest V;= .y ys foryst,j- Hf there is still a tie, the one with the lowest 


. s=1 t=1 
category number is chosen. 


Approximate 100(1—a)% confidence intervals for the cumulative predicted probabilities 


_—-l(¢ - . 
g GEx =e bya /2O7 , aN = 1, ave 3 J — 1, 


If either endpoint in the argument is outside the valid range for the inverse link function, the 
corresponding confidence interval endpoint is set to a system missing value. 


The degrees of freedom can be computed by the residual method or Satterthwaite method. 

For the residual method, v = N — (J —1-+ p,). For Satterthwaite’s approximation, 

the L matrix is constructed by (X,, ;,Z,,;), where X,, ; and Z ,, ;are the jth rows of 

X,, and Z,,, respectively, corresponding to the jth category. For example, the L matrix is 
1,0,...,0, — xl Zs) for the 1st category. The computation should then be 


ne : /1x(J-1+p+r) . 
similar to that described in “Method for computing degrees of freedom”. 
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GENLIN Algorithms 


Generalized linear models (GZLM) and generalized estimating equations (GEE) are commonly 
used analytical tools for different types of data. Generalized linear models cover not only widely 
used statistical models, such as linear regression for normally distributed responses, logistic 
models for binary data, and log linear model for count data, but also many useful statistical 
models via its very general model formulation. However, the independence assumption prohibits 
application of generalized linear models to correlated data. Generalized estimating equations were 
developed to extend generalized linear models to accommodate correlated longitudinal data and 
clustered data. 


Generalized Linear Models 


Generalized linear models were first introduced by Nelder and Wedderburn (1972) and later 
expanded by McCullagh and Nelder (1989). The following discussion is based on their works. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 46-1 

Notation 

Notation Description 

n Number of complete cases in the dataset. It is an integer and n> 1. 

Pp Number of parameters (including the intercept, if exists) in the model. It is an integer 
and p> 1. 

Px Number of non-redundant columns in the design matrix. It is an integer and px > 1. 

y n Xx 1 dependent variable vector. The rows are the cases. 

r n x 1 vector of events for the binomial distribution; it usually represents the number of 
“successes.” All elements are non-negative integers. 

m nx 1 vector of trials for the binomial distribution. All elements are positive integers 
and mj > rj, i=1,...,n. 

H n x 1 vector of expectations of the dependent variable. 

n n x 1 vector of linear predictors. 

Xx n x p design matrix. The rows represent the cases and the columns represent the 
parameters. The ith row is (%j1,...,Xip)T, i=1,..., n with x;, = 1 if the model has an 
intercept. 

O n x 1 vector of scale offsets. This variable can’t be the dependent variable (y) or one of 
the predictor variables (X). 

3 p x 1 vector of unknown parameters. The first element in 3 is the intercept, if there is one. 

1 n x 1 vector of scale weights. If an element is less than or equal to 0 or missing, the 
corresponding case is not used. 

f nx 1 vector of frequency counts. Non-integer elements are treated by rounding the value 
to the nearest integer. For values less than 0.5 or missing, the corresponding cases are 
not used. 

N n 


Effective sample size. N = > fi. If frequency count variable f is not used, N =n. 
i=l 
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Model 


A GZLM of y with predictor variables X has the form 
n=g(Ely))=X6+0, y~F 


where 1 is the linear predictor; O is an offset variable with a constant coefficient of 1 for each 
observation; g(.) is the monotonic differentiable link function which states how the mean of 
y, E(y) = p, is related to the linear predictor y ; F is the response probability distribution. 
Choosing different combinations of a proper probability distribution and a link function can 
result in different models. 


Some combinations are well known models and have been provided in different IBM® SPSS® 
Statistics procedures. The following table lists these combinations and corresponding procedures. 


Table 46-2 

Distribution, link function, and corresponding procedure 

Distribution Link function Model Procedure 

Normal Identity Linear regression GLM, REGRESSION 
Binomial Logit Logistic regression LOGISTIC REGRESSION 
Poisson Log Loglinear GENLOG 


In addition, GZLM also assumes yj are independent for i=1,....,n. This is the main assumption 
which separates GZLM and GEE. Then for each observation, the model becomes 


m=9 ti) =21 8 +0, yin PF 


Notes 


™ X can be any combination of scale variables (covariates), categorical variables (factors), 
and interactions. The parameterization of X is the same as in the GLM procedure. Due to 
use of the over-parameterized model where there is a separate parameter for every factor 
effect level occurring in the data, the columns of the design matrix X are often dependent. 
Collinearity between scale variables in the data can also occur. To establish the dependencies 
in the design matrix, columns of X'WX, where © = diag(fiwi,... fnwn),, are examined by 
using the sweep operator. When a column is found to be dependent on previous columns, 
the corresponding parameter is treated as redundant. The solution for redundant parameters 
is fixed at zero. 


m= When y is a binary dependent variable which can be character or numeric, such as 
“male”/”female” or 1/2, its values will be transformed to 0 and 1 with 1 typically representing 
a success or some other positive result. In this document, we assume to be modeling 
the probability of success. In this document, we assume that y has been transformed to 
0/1 values and we always model the probability of success; that is, Prob(y = 1). Which 
original value should be transformed to 0 or 1 depends on what the reference category is. If 
the reference category is the last value (REFERENCE=LAST in the syntax), then the first 
category represents a success and we are modeling the probability of it. For example, if 
REFERENCE=LAST is used in the syntax, “male” in “male”/”female” and 2 in 1/2 are the last 
values (since “male” comes later in the dictionary than “female”) and would be transformed 
to 0, and “female” and 1 would be transformed to 1 as we model the probability of them, 
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respectively. However, one way to change to model the probability of “male” and 2 instead 
is to specify REFERENCE=FIRST in the syntax. Note if original binary format is 0/1 and 
REFERENCE=LAST is specified, then 0 would be transformed to 1 and 1 to 0. 


m= When r, representing the number of successes (or number of 1s) and m, representing 
the number of trials, are used for the binomial distribution, the response is the binomial 
proportion y = r/m. 


Multinomial Distribution 
The response variable y is assumed to be ordinal; its values have an intrinsic ordering and 


correspond to consecutive integers from 1 to J. The design matrix X includes model predictors, 
but not an intercept. The following new notations are needed to define the model form: 


Table 46-3 

Notation 

Notation Description 

J The number of values for the ordinal response variable, J > 1. 

v J—1x 1 vector of threshold parameters yw = (0, W2,..., wa-1) and 
Wr < Yoo. WeI- 

B p x 1 vector of regression parameters associated with model _ predictors, 
B = (61, B2,---;Bp) - 

B foe ee 
(J-1+p) x 1 vector of all parameters, B = (u iB ) 

Vin Conditional cumulative response probability for category j given observed independent 
variable vector, 7;,; = P (yi < 3|X;) 

Tij Conditional response probability for category j given observed independent variable 
vector, m3 =P (ye = 9/j) and wag = eg — H,g—280T7 = Ly. 2S; 

i Linear predictor value of case i for category j. It is related to y;,; through a cumulative 


link function. 
Ni =9 (Vij) = Vj —x}p + Oi, Yi rw F. 


Probability Distribution 


GZLMs are usually formulated within the framework of the exponential family of distributions. 
The probability density function of the response Y for the exponential family can be presented as 


a3 4 —b(@ 
f(y) =exp read i c(u.0/2)} 


¢ ) / UW 


where 0 is the canonical (natural) parameter, # is the scale parameter related to the variance of y 
and @ is a known prior weight which varies from case to case. Different forms of b(8) and c(y, 
¢/@) will give specific distributions. In fact, the exponential family provides a notation that allows 
us to model both continuous and discrete (count, binary, and proportional) outcomes. Several are 
available including continuous ones: normal, inverse Gaussian, gamma; discrete ones: negative 
binomial, Poisson, binomial, ordinal multinomial; and a mixed distribution: Tweedie. 


The mean and variance of y can be expressed as follows 
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where b (#) andb (@) denote the first and second derivatives of b with respect to 0, respectively; 
V (2) is the variance function which is a function of i. 


In GZLM,, the distribution of y is parameterized in terms of the mean (u) and a scale parameter 
(@) instead of the canonical parameter (8). The following table lists the distribution of y, 
corresponding range of y, variance function (V(1)), the variance of y (Var(y)), and the first 
derivative of the variance function V” ( jt)), which will be used later. 


Table 46-4 
Distribution, range and variance of the response, variance function, and its first derivative 


Distribution Range of y Vip) Var(y) V’(w) 

Normal (0,00) 1 @ 0 

Inverse Gaussian (0,00) Ww 3 32 

Gamma (0,00) 2 bu2 2u 

Negative binomial | 0(1)oo utky2 utky2 14+2ku 

Poisson 0(1)co H H 1 

Binomial(m) 0(1)m/m u(1-p) uw(1—-p)/m 1-2p 

Tweedie(q) [0,c0) ud bud qua! 

Multinomial 10,)J There are not simple forms for ordinal multinomial, but they 
are not needed for parameter estimation. 


Notes 
m™ (0(1)z means the range is from 0 to z with increments of 1; that is, 0, 1, 2, ..., z. 
rT] 


For the binomial distribution, the binomial trial variable m is considered as a part of the 
weight variable a. 


If a weight variable is presented, @ is replaced by ¢/o. 


For the negative binomial distribution, the ancillary parameter (k) can be user-specified or 
estimated by the maximum likelihood (ML) method. When k = 0, the negative binomial 
distribution reduces to the Poisson distribution. When k = 1, the negative binomial is the 
geometric distribution. 


The Tweedie class of distributions includes discrete, continuous and mixed densities as long 
as q <0 or q= 1, where q is the exponent in the variance function. Special cases include the 
normal (q = 0), Poisson (q = 1), gamma (q = 2) and inverse Gaussian (q = 3). Except for these 
special cases, the Tweedie distributions cannot be written in closed form. Here, we only 
consider the Tweedie distributions for 1 < q < 2, which can be represented as Poisson mixtures 
of gamma distributions and are mixed distributions with mass at zero and with support on 

the non-negative real values. These distributions are sometimes called “compound Poisson”, 
“compound gamma” and “Poisson-gamma” distributions. g must be user-specified. 
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Scale parameter handling. The expressions for V(1) and Var(y) for continuous distributions 
and Tweedie distributions include the scale parameter @ which can be used to scale the 
relationship of the variance and mean (Var(y) and 1). Since it is usually unknown, there are 
three ways to fit the scale parameter: 


It can be estimated with 3 jointly by maximum likelihood method. 
It can be set to a fixed positive value. 


It can be specified by the deviance or Pearson chi-square. For more information, see the topic 
“Goodness-of-Fit Statistics”. 


On the other hand, discrete distributions do not have this extra parameter (it is theoretically equal 
to one). Because of it, the variance of y might not be equal to the nominal variance in practice 
(especially for Poisson and binomial because the negative binomial has an ancillary parameter k). 
A simple way to adjust this situation is to allow the variance of y for discrete distributions to have 
the scale parameter as well, but unlike continuous distributions, it can’t be estimated by the ML 
method. So for discrete distributions, there are two ways to obtain the value of @: 


It can be specified by the deviance or Pearson chi-square. 


It can be set to a fixed positive value. 


To ensure the data fit the range of response for the specified distribution, we follow the rules: 


m For the gamma or inverse Gaussian distributions, values of y must be real and greater than 
zero. If a value of y is less than or equal to 0 or missing, the corresponding case is not used. 


m™ For the negative binomial and Poisson distributions, values of y must be integer and 
non-negative. If a value of y is non-integer, less than 0 or missing, the corresponding case is 
not used. 


= For the binomial distribution and if the response is in the form of a single variable, y must 
have only two distinct values. If y has more than two distinct values, the algorithm terminates 
in an error. 


m@ For the binomial distribution and the response is in the form of ratio of two variables denoted 
events/trials, values of r (the number of events) must be nonnegative integers, values of m 
(the number of trials) must be positive integers and mj; > rj, V i. If a value of r is not integer, 
less than 0, or missing, the corresponding case is not used. If a value of m is not integer, less 
than or equal to 0, less than the corresponding value of r, or missing, the corresponding 
case is not used. 


m= For the Tweedie distributions, values of y must be zero or positive real. If a value of y is less 
than O or missing, the corresponding case is not used. 


The ML method will be used to estimate 3 and possibly # for continuous distributions and the 
Tweedie distribution, or k for the negative binomial. The kernels of the log-likelihood function 
(€) and the full log-likelihood function (¢), which will be used as the objective function for 
parameter estimation, are listed for each distribution in the following table. Using ¢ or f, won’t 
affect the parameter estimation, but the selection will affect the calculation of information criteria. 
For more information, see the topic “Goodness-of-Fit Statistics”. 
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Table 46-5 
The log-likelihood function for probability distribution 
Distribution € and £ 
Normal ; fi f wily —pi)? @ 
ey. > : { 3 In (2)} 
£=0 s- fi 27) } 
=¢,4 ~%g {In (27 ) 
i=1 
Inverse Gaussian pa 3 fi { wilys — wi)? : ey? 
e = 2 oy pe? wy 
£=€ ‘2 f 2a 
=; 4 ~%5 {In (27 )} 
i=1 
Gamma "fw; WiYi WiYi wi\\ | 
=e (3 ) ree In (r())} 
l=, 4 S- fi{—In(yi)} 
i=l 
Negative 
binomial fy = ys fe {yi ln (Apes) — (ye + 1/k) In (1 + ky) + mn (CP (ys + 1/k)) — n(P1/k))} 
l= Det In (L(y; + 1))} 
i=l 
Poisson 
ey. = yon {yi ln (jai) — pi} 
b=6, 4 LAG ind} 
i=1 
Binomial(m 
(mn) = Ns Be ante) +(1—y:)In(1 — p.)} 
eo “fin (™ ) sme () = atten 
Tweedie oe wi fvine wet? 
ly =o" fiqln(Vi) + (l=9) (2-4) 
£= ty. Se ei fi{-—In(y )} 
Multinomial lifyi= J 
Fi vi= 7 
l=? => dy l ug)o ij = f + si 
k =>! ieyy jIn (mj) ;where yi Mos 


When an individual y = 0 for the negative binomial, Poisson or Tweedie distributions and y = 0 
or 1 for the binomial distribution, a separate value of the log-likelihood is given. Let ¢,j be 


the log-likelihood value for individual case i when yj = 


O for the negative binomial, Poisson 
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and Tweedie and 0/1 for the binomial. The full log-likelihood for i is equal to the kernel of the 
log-likelihood for i; that is, €j=€,.;. 


Table 46-6 

Log-likelihood 

Distribution fi 

Negative binomial Ci = ff GAS ify: =0 

Poisson Ci = fit ui if yi = 0 

Binomial(m) ee a a 
fie fie nui if y=l 

Tweedie f%o— ify =0 


m [(z) is the gamma function and In(I‘(z)) is the log-gamma function (the logarithm of the 
gamma function), evaluated at z. 


= For the negative binomial distribution, the scale parameter is still included in ¢, for flexibility, 
although it is usually set to 1. 


m For the negative binomial distribution, yj must be a non-negative integer, which 
n 


means ['(y; + 1)= y,;!and ¢= ¢; + x fit {—In (yi!)}- In addition, ¢;. can be 


i=l 


; n a 1 yi-l 
written as wi yi ln (kypi) — @ } ;) In (1+ ky;) 4 se In(1 +47) $ because 
oF j=l 


y 
Mitl/k) _ II (j + 1/k). Some potential computational problems can be avoided by using 


M/k) 
j=0 
this form. See Cameron and Trivedi (1998, P. 72). 


m For the binomial distribution (1/m), the scale weight variable becomes w? = w;m; in &; that 


is, the binomial trials variable m is regarded as a part of the weight. However, the scale 
weight in the extra term of ¢ is still w;. 


m= V; in the Tweedie distribution is an infinite series as follows: 


lee) 
w=) 0% 
j=1 
and 
p wl PT y Fa 
Vij = G=ae-g'riayi 
where a = er and. To evaluate the infinite summation for V;, the value of j is determined for 
which V;; reaches a maximum and sum the necessary terms of the series in that region. The 


method proposed by Dunn and Smyth (2005) is adopted here. 


q-1)!° 


Link Function 


The following tables list the form, inverse form, range of ji, and first and second derivatives 
for each link function. 


GENLIN Algorithms 


Table 46-7 


Link function name, form, inverse of link function, and range of the predicted mean 


Link function n=2(1) Inverse n=g1(n) Range of fi 
Identity ut ul fheR 
Log In(u) exp(n) fi>o 
Logit i (4<) a fi € (0,1) 
Probit ®~*(). where D(n) Ae [0,1 
2) = pe fe Pas 
Complementary In(—(In(1-)) : 1—exp(—exp(n)) fi € [0,1 
log-log 
{a0 Tig nile fi € Rita orl/a is odd integer 
Power(a ; 
™ ex(o){ a=0 { In (2) { exp (7)) fi > 0 otherwise 
Log-complement In(1-1) 1—-exp(n) acl 
Negative log-log —In(—In()) exp(—exp(—n)) jf € (0,1) 
Negative bi ial r —exp (1) _ i> 0 
egative binomia In (+z) ett Ba) fi > 
Odd efQ=e) =) —Gtan)'74_ i € [0, 1) 
: fa#0 2 L+(+-an)*/ 4 A [0,1] 
powers) a=0 In (+47) ey 
Note: In the power link function, if |a| < 2.2e-16, a is treated as 0. 
Table 46-8 
The first and second derivatives of link function 
Link function First derivative g (j:) = a = A | Second derivative g” (4) = vy 
Identity 1 0 
Log 7 —A? 
Logit _— A? (2-1) 
Probit 1, where AO (11) 
a wy : 
(2) = ime? 14 
Complementary log-log a ee —A?(1+In(1 — 11)) 
ape! 
Power(a) 1 
a#0 im As-t 
a=0 —A? 
Log-complement = - —A? 
Negative log-log a A?(1 + In (x) 
Negative binomial ah? —A?(14+ 2k) 
s get A(s r wt) 
ea ey “aT I=p 
a= @ ees A? (2 = 1) 
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Table 46-9 


Cumulative Link Function Name, Form, Inverse Form and Range of the Predicted Cumulative 
Probability 


Link function n=g9(y) Inverse y=g" !(n) Range of + 

Cumulative logit iv ( x) Te FE [0,1 

Cumulative probit @®~'(>), where D (7) FeE[O,1 
O(€)= — ahi e* 

Cumulative In (- mT - 1 — exp (— exp (77)) FE [0,1 

complementary 

log-log 

Cumulative negative | — In (—In(y)) exp (— exp (—7))) Fe [0.1 

log-log 

Cumulative Cauchit | tan (7(7 —0.5)) 0.5 + arctan (7) /7 VE [01 


Note: 1 in the formulae is the number, not the response probability. 


Table 46-10 
The Inverse First and Second Derivatives of Cumulative Link Function 


Link function Inverse first derivative ae =A | Inverse second derivative a 
Cumulative logit y(1 -7¥) A(1 — 2y) 
Cumulative probit o(®~*(y)), where -A x 0*(7) 
oz) = 1 = 27/2 
Cumulative complementary (y-1)n(l - ¥) A(1+In(1 - ¥)) 
log-log 
Cumulative negative log-log |—71n (7) —A(1 + In(y)) 
Cumulative Cauchit cos?(m(7 —0.5))/a A X sin (2777) 


When the canonical parameter is equal to the linear predictor, @ = n, then the link function is 
called the canonical link function. Although the canonical links lead to desirable statistical 
properties of the model, particularly in small samples, there is in general no a priori reason why 
the systematic effects in a model should be additive on the scale given by that link. The canonical 
link functions for probability distributions are given in the following table. 


Table 46-11 

Canonical and default link functions for probability distributions 
Distribution Canonical link function 
Normal Identity 

Inverse Gaussian Power(—2) 
Gamma Power(—1) 
Negative binomial Negative binomial 
Poisson Log 

Binomial Logit 

Tweedie Power(1—q) 
Multinomial Cumulative logit 
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Estimation 


Having selected a particular model, it is required to estimate the parameters and to assess the 
precision of the estimates. 


Parameter estimation 


The parameters are estimated by maximizing the log-likelihood function (or the kernel of the 
log-likelihood function) from the observed data. Let s be the first derivative (gradient) vector of 
the log-likelihood with respect to each parameter, then we wish to solve 


5 
s= EA =—0 
05 pxi1 


or, for the multinomial distribution, 


Jl | OW | 

JB an 

CP \(J-14+p)x1 | aB | 
In general, there is no closed form solution except for a normal distribution with identity link 
function, so estimates are obtained numerically via an iterative process. A Newton-Raphson 


and/or Fisher scoring algorithm is used and it is based on a linear Taylor series approximation 
of the first derivative of the log-likelihood. 


First Derivatives 


If the scale parameter @ is not estimated by the ML method, s is a px1 vector with the form: 
nm 


fiwi(yi — bi) 1 fiwi(yi — Hi) 
eed DS I JiWi (Yi — bi 
BD) J ariaAd aN tO Rl og 
dv (vi) 9! (mi) 51 ig (mi) 


where ju;, V (ju;) and g (j;) are defined in Table 46-7 “Link function name, form, inverse of link 
function, and range of the predicted mean”, Table 46-4 “Distribution, range and variance of the 
response, variance function, and its first derivative”, and Table 46-8 “The first and second 
derivatives of link function”, respectively. 


If the scale parameter @ is estimated by the ML method, it is handled by searching for @) since 
@ is required to be greater than zero. Similarly, if the ancillary parameter k for negative binomial 
is estimated by the ML method, it is still handled by searching for In(k) since k is also required to 
be greater than zero. 


Let t= @) so = exp(t) (or t= In(k) and k = exp(t) for negative binomial), then s is a (p+1)x1 
vector with the following form 
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n p 
ae i imi(Yi — Mi) 
O03 ex a , AD 
s= | 4 - eG V (ua) 9’ (4) 
Or J (p+1)x1 ol 
OT 


where 0¢ /03 is the same as the above with © is replaced with exp(t) (though for negative 
binomial, @ is not replaced), 0(/07 has a different form depending on the distribution as follows: 
Table 46-12 


The 1st derivative functions w.r.t. the scale parameter for probability distributions 


Distribution ot 


ar 
Normal fi f wily — pi)? 1 
A 2 exp (T) 
Inverse Gaussian ~ fi { wi(yi — pi)? 1 
2 lexp(r) yp? 


Gamma is fiwi WiYi Yi wi 
~~ < In ; 1—- w 
i~ exp(T) exp (T) Li [i exp (T) 


Negative Binomial <n fiw; oy ook hy 1 I 1 
gS an tala fa, tln (14 exp (7) fi) Py be) Vlas 
where for all appropriate link functions other than negative binomial link 
function, 
— exp(t)(yi-Hi) 
a = (1+exp(r) 3) 


and for the negative binomial link function, 


a; =0 
Tweedie pen 
47=1/4% O07? 
where 
wpe f 
idl} = 
ok; — exp(tT)(2—q) a Y= 0 
Or avi yea? 1 pen 
OT “i u L bed: Thay 
Vi exp(t)(1l—g) op(ryaq) or Yi? 0 


Note: y)(z) is a digamma function, which is the derivative of logarithm of a gamma function, 
evaluated at z; that is, u(z) = eee!) = Le, 
As mentioned above, for normal distribution with identity link function which is a classical linear 
regression model, there is a closed form solution for both 3 and Tt, so no iterative process is 
needed. The solution for 3, after applying the SWEEP operation in GLM procedure, is 


B= (3° fw xs s FuciX} (yi 0%) ) = (x' x) i (x Hy 7 °)); 


i=l i=1 
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where P = diag(fiwi,...f,w,) and (Z) is the generalized inverse of a matrix Z. If the scale 
parameter @ is also estimated by the ML method, the estimate of t is 


7 =n (0) -In (; 3 fiwi (vi a x3 “)') ; 
“i=l 


For the ordinal multinomial model: 


:-|2 ara at 


Ow? "Ouwy_1? OB,’ " OBy 


de ~~ fii O46 (Ying — Yagi 
oe nae Gea ;=1,...,J—-1 
Ow a 7) Oni; Ti .4 J , ’ 

— v 


Nig 4,541 


J. 
Ol ” fiwi (OV = OVigj-1\ Vig 

--)5 y - _- rit,t=1,...,p. 
VS ; O) Oni Oni j-1 


ij 


and 


Tig = Vij — Vigj-1 for 9 = 1.2.43 


0 7=0 
Vij = gt (¥; — x! p + 0) j = t. peg J —] 5 
1 oe 


Note: if 0>;,; =0 or Oy;,;= 1 then a -- = Q for all cumulative link functions. 


i 
Oni, j 


Second Derivatives 


Let H be the second derivative (Hessian) matrix. If the scale parameter is not estimated by the ML 
method, H is a pp matrix with the following form 


agagt 


H=| us | — _xlwx 
PpXxp 


where W is an nxn diagonal matrix. There are two definitions for W depending on which 
algorithm is used: We for Fisher scoring and Wg for Newton-Raphson. The ith diagonal element 


for We is 
_ fiwi 1 
© — V (ti) (g' (tid)? 


and the ith diagonal element for Wo is 
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where V (j.;) and g’” (j:;) are defined in Table 46-4 “Distribution, range and variance of the 
response, variance function, and its first derivative” and Table 46-8 “The first and second 
derivatives of link function”, respectively. Note the expected value of Wo is We and 


when the canonical link is used for the specified distribution, then Wo = We. 


If the scale parameter is estimated by the ML method, H becomes a (p+1)x(p+1) matrix with the 
form 


ve ae 

H= apagl 9307 
= ae ae 
aragl eT” 


(p+1)x(p+1) 


where 0?//30r7 is a px1 vector and 3? ¢/aragt is a 1xp vector and the transpose of 0?//030r. 
For all three continuous distributions: 


n 


A? ( Fw; ( a [i Ot 
ee a2 _ Siwilyi = fi) ae 
O80r exp (7) V (pi) g’ (42) OB 


The forms of 0?//0B07 for negative binomial are as follows depending on the link functions: 


For all appropriate link functions other than negative binomial link function, 
Aan n ; 
aPC - fiwi exp (7) (yi — 14) 


OBOr (1 + exp (T) wi)g (11;) 


for the negative binomial link function, 


Pl _ xn fii 
OBOT ap @ - 


The forms of 0°//07? are listed in the following table. 


Table 46-13 
The second derivative functions w.r.t. the scale parameter for probability distributions 


Distribution a7e 


ar 


ae se (Yi — bi)” 
= 2exp(rT)°"" 


Inverse Gaussian | _” 


a fiwi 5(y pi)? 
2exp(T) yop? yi 


i=l 


Normal 
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Distribution aze 
Or? 
Gamma n . 
Ss" fiwi In Wi Yi 1 2 Yi ub Wy Wy uy Wj 
oy exp (T) exp (T) [i [i exp (T) exp (T) exp (T) 
Negative @i — Spry In (1 + exp (7) mi )4 
Binomial 


“ 3 1 1 ly 1 
. aa wente ; (uy: =is) Y (5) ; 
1 ie i 1 1" 1 
expt 2r E (uy I say) ¥ (ade )I 


where for all appropriate link functions other than negative binomial link function, 


—y; exp(T) i; + uy +2 exp( Tr)? 


ai = (l+exp(r);)? 


and for the negative binomial link function, 


a; = 0 


Tweedie 


> hig So , 


where 
, ~ for y= 0 
are; _ a2y a o 
@r2. fi Ovi 6 wigan q wp? q : 
O72 Dr 41 i — i for y; > 0 
Vi Vi "Texp(7)(1H@}q) exp(r)(2—4) Yi 


Note: v) (z)isa trigamma function, which is the derivative of 1(=), evaluated at z. 


OO 
Ve = (q— ee jV;; and the evaluation of it is similar to that of the series V; = S- Vij- 


Or 
j=1 J=1 


For the ordinal multinomial model: 


Ore Ore 
2) _ | evT ovo 
OBOpT J (J-14p)x(J-1+p) of oe 


Bowl — pPagT 


The elements of H have two forms: (1) the expected first derivatives of the estimating equation 
s which is applied to Fisher scoring and (2) the first derivatives of the estimating equation s 
which is applied to Newton Raphson. 


Expected second derivatives have the following expressions: 


wi OV 5-1 97i,j 1 
TG Te = o On. ae Tig 
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a2e 2 tif Opa OF 1 1 
a — se | E = aja = 1 
Owe <b \ONi 5 Tig Migj+d 
a7 
——— = 0), for |J— Jj] > 1, 
Ov 0 Wj | | 
Ore 3 fiwil fOngj — A%j-1\ 1 Ovij+1 AFG) 1 |OKs 
Oyen. Q Oni — OM j-1 ] Mig Ongar ONG J Mij+1] Ong ™ 


i=l 
9 = Mews yah = Det = Leo ag Hh 


42) J. ; 9 
a7 : fiw; Ov; 4 Oy; j-l 1 
OB08, a es Gating Lylseghh 
OF03y ee o Oni ON ai — ab“ iu ) f 


'2,J 


Second derivatives have the following expressions: 


2p n 
Ob _ yn Simi PNG AOWG YG 7g oy 
Ovj100j Sd Onij-1 OM 77; . 
nie n , 4? 2 
Ey, 7 . fiw; | O°", ( Hitt ($4) Vij a Yij-+1 j=l = 
Q/;2 f 2 = © 9 1 it le ’ 
ows fe Lomi \mig Tid Ong \ M5 Migs 
are , 
ALTAL =9; for | — j| > 1, 
OY OW; 
n 2. a, Ati ‘ 
_—_—>> fiwi| Orig ig & OVi,j :) Yrs 
i OB, > ¢ "tJ = lm: - ms. - - Parr tT 
Oy ;9! i ¢ On? ; Oni \On,j7 = ON j-1 m 
n 25 ¢€ e « 
3 fii | Pris ots (Fue a | Yij+1 
; 5.2 Nisj+ - a... an Dy its 
; @ | On j Oni,j Oni j+1 Oni, j Th j+1 


1 5 


n J By By ¢ c 2 
ae yy S- fiwi| (OP rig — PrvEj—1 be & 7) Yid ag 
398. j i,) F “itLius 
OPOBu 2 @ On? ; On? : Onij Oni j-1 7 


Iterations 


An iterative process to find the solution for 6 (which might include ©, k for negative binomial 
or for multinomial) is based on Newton-Raphson (for all iterations), Fisher scoring (for all 
iterations) or a hybrid method. The hybrid method consists of applying Fisher scoring steps for 
a specified number of iterations before switching to Newton-Raphson steps. Newton-Raphson 
performs well if the initial values are close to the solution, but the hybrid method can be used 
to improve the algorithm’s robustness from bad initial values. Apart from improved robustness, 
Fisher scoring is faster due to the simpler form of the Hessian matrix. 
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The following notation applies to the iterative process: 


Table 46-14 

Notation 

Notation Description 

I Starting iteration for checking complete separation and quasi-complete separation. It 
must be 0 or a positive integer. This criterion is not used if the value is 0. 

J The maximum number of steps in step-halving method. It must be a positive integer. 

K The first number of iterations using Fisher scoring, then switching to Newton-Raphson. 
It must be 0 or a positive integer. A value of 0 means using Newton-Raphson for all 
iterations and a value greater or equal to M means using Fisher scoring for all iterations. 

M The maximum number of iterations. It must be a non-negative integer. If the value is 
O, then initial parameter values become final estimates. 

€¢, €p,€H Tolerance levels for three types of convergence criteria. 

Abs A 0/1 binary variable; Abs = 1 if absolute change is used for convergence criteria 


and Abs = 0 if relative change is used. 


And the iterative process is outlined as follows: 


1. Input values for I, J, K, M, ¢,, ep,ey; and Abs for each type of three convergence criteria. 


2. Input initial values B© or if no initial values are given, 
3. Let €=1. 


4. Compute estimates of ith iteration: 


0) — gli-1)_ (He ') “glint ), where (H)~ is a generalized inverse of H. Then compute the 
log-likelihood based on 3 Gi), 


5. Use step-halving method if (i) — ¢(i-1): reduce € by half and repeat step (4). The set of values 
of € is {0.5):j=0,..., J—1}. If J is reached but the log-likelihood is not improved, issue a 
warning message, then stop. 


6. Compute gradient vector s‘’) and Hessian matrix H'’) based on 3 (). Note that We is used 
to calculayq'’) if i< K; Wo is used to calculate Yq’) if i>K. 


7. Check if complete or quasi-complete separation of the data is established (see below) if 
distribution is binomial or ordinal multinomial and the current iteration i > I. If either complete or 
quasi-complete separation is detected, issue a warning message, then stop. 


8. Check if all three convergence criteria (see below) are met. If they are not but M is reached, 
issue a warning message, then stop. 


9. If all three convergence criteria are met, check if complete or quasi-complete separation of the 
data is established if distribution is binomial or ordinal multinomial and i<I (because checking 
for complete or quasi-complete separation has not started yet). If complete or quasi-complete 
separation is detected, issue a warning message, then stop, otherwise, stop (the process converges 
for binomial or ordinal multinomial successfully). If all three convergence criteria are met for the 
distributions other than binomial or ordinal multinomial, stop (the process converges for other 
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distributions successfully). The final vector of estimates is denoted by 3 (and 7 and W for ordinal 
multinomial). Otherwise, go back to step (3). 


Initial Values 
If initial values are not specified by the user, they are calculated as follows: 


Set the initial fitted values (4 = (y;1m; + 0.5) /(m,; + 1) for a binomial distribution (yj can be 
a proportion or 0/1 value) and ji; = y; for a non-binomial distribution. From these derive 
= g (ji; ag (ji;) andy ’ (jij). If 7}; becomes undefined, set 7); — 1. 


Calculate the weight matrix }j”, with the diagonal element j,, = be Fa a where ¢ is 


~~ (ti Cg (fi) 
set to 1 or a fixed positive value. If the denominator of w.; becomes 0, oa as 0. y 


Assign the adjusted dependent variable z with the ith observation 
2 = (Hi — 0;) + (yi — fa) g (ji;) for a binomial distribution and 2; = (7); — 0;) for a non-binomial 
distribution. 


Calculate the initial parameter values 
: 2 —l 4 
gO) — (xTW.x) xl iinz 
and 
(0) (0) ns (0) 
@ =(z — XB ) We (z — XB ) ; 
if the scale parameter is estimated by the ML method. 


For the ancillary parameter k of the negative binomial model, the initial k = 1, so the initial t = 0. 


n 
For the ordinal multinomial model, let V; = 3 fiyi,; be the number of responses in category j, 
i=1 
n 


and VN = LS f; be the effective sample size. Initial values for the threshold parameters, with and 


i=1 
without the offset variable, are then computed according to the following formulae: 


J N;, 
9) _ ja eul 
"I = ( N 


and 
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n 


jon J 
for j=1,....J-1, where 0; = ye LS fae 3 o/>- Tales. 


l=1 1=1 l=1 i=1 


Initial values for all regression parameters are set to zero. 


Scale Parameter Handling 


1. For normal, inverse Gaussian, gamma and Tweedie response, if the scale parameter is estimated 
by the ML method, then it will be estimated jointly with the regression parameters; that is, the last 
element of the gradient vector s is with respect to T. 


2. Ifthe scale parameter is set to be a fixed positive value, then it will be held fixed at that value for 
in each iteration of the above process. 


3. Ifthe scale parameter is specified by the deviance or Pearson chi-square divided by degrees of 
freedom, then it will be fixed at 1 to obtain the regression estimates through the whole iterative 
process. Based on the regression estimates, calculate the deviance and Pearson chi-square values 
and obtain the scale parameter estimate. 


Checking for Separation 


For each iteration after the user-specified number of iterations; that is, if i > I, calculate (note 
here v refers to cases in the dataset) 


Puin = min Pv 
v 

Pmax = Max py, 
v 

ok 

Pyin 


= min (min (/t,, 1 — juy)), 
U 


where 


,, — J Hv if yy =success(= 1) 
ea et Be {ly ify, = failure (= 0) 


(p, is the probability of the observed response for case v) and ju,, = g~! (xPB + ow) 
For the ordinal multinomial model, the definitions are modified as follows: 


Pmin = WUD Ty 7, 
F j 


Pmax = sear ia TU Yu 


Pmin = min (sun n»4 
~ 7 a 
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The rules for checking complete separation or quasi-complete separation for binomial or 
multinomial models are otherwise the same. 


If min (Pmins Pmax) = Pmin > 0.99 we consider there to be complete separation. Otherwise, if 
Pmax > 0.99 or p*,,,, < 0.001 and if there are very small diagonal elements (absolute value 

< V10~‘ » 3.16 x 107~*) in the non-redundant parameter locations in the lower triangular matrix 
in Cholesky decomposition of —H, where H is the Hessian matrix, then there is a quasi-complete 
separation. 


Convergence Criteria 


The following convergence criteria are considered: 


A)_-1) 


Log-likelihood convergence: 


InaxX 
Parameter convergence: ‘ ( | 
max; ( By) f-|) < ep if absolute change 
T - 
(©) (x) () 
C) < ey if relative change 

Hessian convergence: he | +10-8 

(s) (HO) (s) < ey if absolute change 


where ¢,, ep and cy, are the given tolerance levels for each type. 


If the Hessian convergence criterion is not user-specified, it is checked based on absolute change 
with gy = 1E-4 after the log-likelihood or parameter convergence criterion has been satisfied. If 
Hessian convergence is not met, a warning is displayed. 


Parameter Estimate Covariance Matrix, Correlation Matrix and Standard Errors 


The parameter estimate covariance matrix, correlation matrix and standard errors can be 
obtained easily with parameter estimates. Whether or not the scale parameter is estimated by 
ML, parameter estimate covariance and correlation matrices are listed for 3 only because the 
covariance between 3 and 7 should be zero. 


If the ancillary parameter k (t) of negative binomial is estimated by ML method, the parameter 
estimate covariance and correlation matrices are still listed for 5 only even though the covariance 
between (3 and 7 is generally not zero. 


For the ordinal multinomial model, parameter estimate covariance and correlation matrices are 
listed for 3 and W. 


Model-Based Parameter Estimate Covariance 
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The model-based parameter estimate covariance matrix is given by 
Xm = —H” = -—(—X WX) 


where H_~ is the generalized inverse of the Hessian matrix evaluated at the parameter estimates. 
The corresponding rows and columns for redundant parameter estimates should be set to zero. 


Robust Parameter Estimate Covariance 


The validity of the parameter estimate covariance matrix based on the Hessian depends on the 
correct specification of the variance function of the response in addition to the correct specification 
of the mean regression function of the response. The robust parameter estimate covariance 
provides a consistent estimate even when the specification of the variance function of the response 
is incorrect. The robust estimator is also called Huber’s estimator because Huber (1967) was 

the first to describe this variance estimate; White’s estimator or HCCM (heteroskedasticity 
consistent covariance matrix) estimator because White (1980) independently showed that this 
variance estimate is consistent under a linear regression model including heteroskedasticity; or 
the sandwich estimator because it includes three terms. The robust (or Huber/White/sandwich) 
estimator is defined as follows 


speed atl ; 
Dp = S ce a ee 3 fi eee iH) \ ge a |x 
T ™m = 08 03 m — m™m OV Li) g! (1;) me 1 m™m 
= / 


For the ordinal multinomial model, 


where 

T 
at;_[ac? ac") _ [ae Al; ae; Cal 
OB | Ow ’ OB Oun’ = Oy OB” OB, 


Parameter Estimate Correlation 


The correlation matrix is calculated from the covariance matrix as usual. Let; ; be an element of 
Xm or =r, then the corresponding element of the correlation matrix is ates The corresponding 
rows and columns for redundant parameter estimates should be set to system missing values. 


Parameter Estimate Standard Error 
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Let 3; denote a non-redundant parameter estimate for all distributions except multinomial. Its 
standard error is the square root of the ith diagonal element of 2m or Zr: 


08, = VCR 


The standard error for redundant parameter estimates is set to a system missing value. If the 
scale parameter is estimated by the ML method, we obtain 7 and its standard error estimate 


67 = T=: where +—> can be found in Table 46-13 “The second derivative functions w.r.t. 


Ore 


the scale parameter for probability distributions”. Then the estimate of the scale parameter 
is exp (f) and the standard error estimate is (exp (7) - G,) 


For the ordinal multinomial model, let bj, ee tere J — 1, be threshold parameter estimates and 
G;,4 =1,...,p. denote non-redundant regression parameter estimates. Their standard errors are 
the square root of the ith diagonal element of £m or Zr: oy, = \/ojj and 73,= Jo(-144), (I-14) 
respectively. 


Confidence Intervals 


There are two methods of computing confidence intervals for the non-redundant parameters. 
One is based on the asymptotic normality of the parameter estimators, and the other is based on 
the profile likelihood function. The latter is time consuming because it needs to run iterative 
processes many times. 


Wald Confidence Intervals 


Wald confidence intervals are based on the asymptotic normal distribution of the parameter 
estimates. The 100(1 — «)% Wald confidence interval for 4 is given by 


a : —— ; 
(3, — 41-0/278; Pj T 410/298, ), 


where z,, is the 100pth percentile of the standard normal distribution. 


Zp 


If exponentiated parameter estimates are requested for logistic regression or log-linear models, 


then using the delta method, the estimate of exp ({3;) is exp ( 3;), the standard error estimate of 


exp (3 :) is (exp (3;) 53,) and the corresponding 100(1 — «)% Wald confidence interval 
exp (3;) for is 


(exp (3; — 21-a; 253, ) ,eXp Ga 10/253, ))- 


Wald confidence intervals for redundant parameter estimates are set to system missing values. 


Similarly, the 100(1 — «)% Wald confidence interval for @ or k of the negative binomial model is 


Additionally, for the ordinal multinomial model, the 100(1 — «)% Wald confidence interval for 


w; is given by 
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(6; - “1-0/2; Vj + “10/259, | 


the estimate of exp (w;) is exp ( w i), the standard error estimate of exp (v;) is 


(exp ( d;) -Jy,,) and the corresponding 100(1 — «)% Wald confidence interval for exp (1;) is 


(exp Gr ~~ “10/250 ,CXP (« “1-a/25¥,) ) 


Profile Likelihood Confidence Intervals 


The construction of profile likelihood confidence interval (PLCI) is first derived from the 
asymptotic Chi-square distribution of the generalized likelihood ratio test by Venzon and 
Moolgavkav (1988). We use the modified algorithm, which is equivalent to theirs, by Heinze and 
Ploner (2002). The computation is iterative and very time consuming, especially if the number 

of predictors is large because the number of iterative processes needed is 2px; for the ordinal 
multinomial model, it is 2(J — 1 + px). PLCIs for redundant parameter estimates are set to system 
missing values and won’t involve iterative processes. 


The iterative process is as follows: 
Let initial values 3 (note it might include t; ¥ for multinomial) be the maximum likelihood 


estimates and initial log-likelihood €©), gradient vector s(°) and Hessian matrix H(°) are obtained 
based on 3), 


2 Calculate 40 = (\") — 0.5y?_,_), where \? ,,_,,) is the 100(1 — a)% percentile of the Chi-square 
distribution with one degree of freedom. 

3 Set the parameter number j = 1. 

4 Set the iteration number i= 1. 

5 


Compute the incremental value A at the (i — 1)th iteration: 


1 


ects of Hoe CEI) HOY) ° 
‘ _ (ul) “e, 


where e, is the jth unit vector. Take the positive values of A first. 


In rare cases, the value in the above braces is negative or (\'~!) is missing or undefined. In that 
case, \\'~!) is undefined (note that \‘°) is highly unlikely to be undefined) and the parameters 
can’t be updated. To solve this problem, in general, we just take a simple average of parameters 
from the two previous iterations pa =s (se “ pe?)). If \“) based on pe) is still undefined, 


we continue the process up to 5 times by taking the average of the current B (i) value and pli-2) till 
A“ becomes defined, otherwise, we issue a warning and stop. 


6 Compute the step size d(‘-!) _ _ (a-?) (sb) 4 Mie; ). 
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7 Update parameter estimates 3 )=s(i-1) | q@-D 


8 Compute log-likelihood 2, gradient vector s© and Hessian matrix H® based on 6. Note 
that whether We or Wo is used to calculate H® should be based on what had been used in the 
maximum likelihood estimates of /. 


2 Check if the following two criteria with tolerance levels €, and €,, are satisfied: 


(a) 9 - 9) < 


(b) (s + xb) (H @)" (s i xj) ee 


If both criteria are met or the maximum number of iterations is reached, stop. Otherwise, set i = i 
+ 1 and go back to step (5). 


10 The final vector of estimates is denoted by 8", then By is the upper confidence limit for 3). 


11. Repeat steps (4) — (9) with negative values of in step (5) to find the lower confidence limit Bt. 
BR Repeat steps (4) — (11) by setting the parameter number j = 2, ..., Dx. 


Note 


m Ifthe scale parameter or ancillary parameter k of the negative binomial model is estimated 
by ML method, then it will be estimated jointly with regression parameters for the iterative 
processes of each regression parameter fj, j = 1, ..., px. Then the PLCI for @ will be obtained 
by the iterative processes as well, and is equal to (exp (7'), exp(7")) Similarly, the profile 
likelihood confidence interval for exp (;) is calculated as (exp (4 ) ,exp (a )): 

m Ifthe scale parameter or ancillary parameter k of the negative binomial model is set to be a 
fixed positive value, then it will be held fixed at that value for each iterative process. 

m Ifthe scale parameter is specified for all distributions by the deviance or Pearson chi-square 
divided by degrees of freedom, then @ will be held fixed at the value estimated from the 
deviance or Pearson statistic during the full model fit for each iterative process. For more 
information, see the topic “Goodness-of-Fit Statistics”. 


Chi-Square Statistics 


The hypothesis Ho; : 3; = Ois tested for each non-redundant parameter using the chi-square 
statistic: 


2 


which has an asymptotic chi-square distribution with 1 degree of freedom. 


Chi-square statistics and their corresponding p-values are set to system missing values for 
redundant parameter estimates. 
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The chi-square statistic is not calculated for the scale parameter, even if it is estimated by ML 


method. 
For the ordinal multinomial model, the hypotheses Ho; : ¢; =0,j =1,...,- J —1, and 
Ho; : 9; =0,i=1,...,p., are tested for threshold parameters and non-redundant regression 


parameters using the chi-square statistics 


- 2 
W4 
Cy, = @) 
J Or; 
and 
: 2 
5; 
C38, = | 2 
. 78, 
P Values 


Given a test statistic T and a corresponding cumulative distribution function G as specified 
above, the p-value is defined as p = 1— G(T). For example, the p-value for the chi-square 
test of Ho; : 3; = Ois po=l — prob(x7 < Ci )s 


Model Testing 


After estimating parameters and calculating relevant statistics, several tests for the given model 
are performed. 


Lagrange Multiplier Test 


If the scale parameter for normal, inverse Gaussian, gamma, and Tweedie distributions is set to a 
fixed value or specified by the deviance or Pearson chi-square divided by the degrees of freedom 
(when the scale parameter is specified by the deviance or Pearson chi-square divided by the 
degrees of freedom, it can be considered as a fixed value), or an ancillary parameter k for the 
negative binomial is set to a fixed value other than 0, the Lagrange Multiplier (LM) test assesses 
the validity of the value. For a fixed ¢ or k, the test statistic is defined as 


NO 


S 


La = 
IM~y 


0708 
parameter estimates and fixed ¢ or k value. T;,,,; has an asymptotic chi-square distribution with 1 
degree of freedom, and the p-values are calculated accordingly. 


where s = 0(/07 and A = (3) ( Ht) ( “) ( ae evaluated at the 
me apoB ad 
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For testing @, see Table 46-12 “The 1st derivative functions w.r.t. the scale parameter for 
probability distributions” and see Table 46-13 “The second derivative functions w.r.t. the scale 
parameter for probability distributions” for the elements of s and A, respectively. 


If k is set to 0, then the above statistic can’t be applied. According to Cameron and Trivedi (1998), 
the LM test statistic should now be based on the following auxiliary OLS regression (without 
constant) 


eee — aft; T ej 
where ji;= g7! (« r3) and e; is an error term. Let the response of the above OLS regression 
[ yi — fi)? — ui| /jt; be 2; and the explanatory variable /i; be w;. The estimate of the above 


regression parameter a and the standard error of the estimate of a are 


n 


> fiwizi 


2 
se 


a andoé, = | 4—~—, 
> fiw? d fiwi 
i=l i=l 
n 
where s? = wh. fie? and e; = 2; — dw;. Then the LM test statistic is a z statistic 
21 
ei OE 
RO RY 
Ta 


and it has an asymptotic standard normal distribution under the null hypothesis of equidispersion 
in a Poisson model (Hy :k = 0). Three p-values are provided. The alternative hypothesis 

can be one-sided overdispersion (H,, : k > 0), underdispersion (H,, : k < 0) or two-sided 
non-directional (H,, : k 4 0) with the variance function of V(j.) = ju + kyu? . The calculation 

of p-values depends on the alternative. For H, :k > 0,p-value= 1 — ®(z), where ®(-) is the 
cumulative probability of a standard normal distribution; for H,, : k < 0,p-value = ©( =z); and for 
Hy :k #0,p-value = 2(1 — ®(|z|)). 


Goodness-of-Fit Statistics 
Several statistics are calculated to assess goodness of fit of a given generalized linear model. 
Deviance 


The theoretical definition of deviance is: 


D = 20(l(y;y) — l(jfzy)), 
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where £ (1; y) is the log-likelihood function expressed as the function of the predicted mean values 
ji (calculated based on the parameter estimates) given the response variable, and ¢ (y; y) is the 
log-likelihood function computed by replacing /: with y. The formula used for the deviance is 
So", fidi, where the form of di for the distributions are given in the following table: 

Table 46-15 

Deviance for individual case 


Distribution di 


Normal wi(yi — pi)? 


Inverse Gaussian 


Gamma 


Negative Binomial ee oe : i+1/k 
¢ , — (ys + 1/k) In (Ze 


Poisson 


Binomial(m) 


Tweedie vi #-(2-a)vin) {+0-a)n? 
(1—q@)(2~q@) 


Note 


m When y is a binary dependent variable with 0/1 values (binomial distribution), the deviance 
and Pearson chi-square are calculated based on the subpopulations; see below. 


m When y = 0 for negative binomial and Poisson distributions and y = 0 (for r = 0) or 1 (for r 
= m) for binomial distribution with r/m format, separate values are given for the deviance. 
Let ¢; be the deviance value for individual case i when y; = 0 for negative binomial and 
Poisson and 0/1 for binomial. 


Table 46-16 

Deviance for individual case 

Distribution d; 

Negative Binomial Qu, BUtHi) ify, =0 

Poisson 2wi pti ify; = 0 

Binomial(m) —2w} In(1 — i) if ys =9 or nr; = 0 
—2w? In(ji) if ys = Lor ri = m: 


Pearson Chi-Square 


n 
9 p 
a s Siri 
i=1 


for the binomial distribution and 4; = oan 2 077 for other distributions. 


we (Yim fi \? 


where 4; = Vian) 


Scaled Deviance and Scaled Pearson Chi-Square 
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The scaled deviance is D* = D/¢ and the scaled Pearson chi-square is \7* = \7/@. 


Since the scaled deviance and Pearson chi-square statistics have a limiting chi-square distribution 

with N — px degrees of freedom, the deviance or Pearson chi-square divided by its degrees 

of freedom can be used as an estimate of the scale parameter for both continuous and discrete 

distributions. 
D 


f f x 
Q=R —p, or@~=7 a 


If the ancillary parameter k of the negative binomial model is estimated by the ML method, the 
scale parameter is measured by the deviance or Pearson chi-square divided by its degrees of 
freedom, then the degrees of freedom is NV — p,. — 1 because k is the extra parameter estimated 
by ML method. 


If the scale parameter is measured by the deviance or Pearson chi-square, first we assume 6 = 1, 
then estimate the regression parameters, calculate the deviance and Pearson chi-square values 

and obtain the scale parameter estimate from the above formula. Then the scaled version of both 
statistics is obtained by dividing the deviance and Pearson chi-square by ©. In the meantime, some 
statistics need to be revised. The gradient vector and the Hessian matrix are divided by @ and 

the covariance matrix is multiplied by #. Accordingly the estimated standard errors are also 
adjusted, the Wald confidence intervals and significance tests will be affected even the parameter 
estimates are not affected by ©. 


Note that two log likelihood values are displayed: the original one (based on ® = 1) and the 
revised one (based on @ = @ which is plugged into the log likelihood function of the corresponding 
distribution). Prior to version 16, only the original one is displayed. The original log likelihood 
is used in computing the information criteria but the revised log likelihood is used in the model 
fitting omnibus test. 


Overdispersion 


For the Poisson and binomial distributions, if the estimated scale parameter is not near the 
assumed value of one, then the data may be overdispersed if the value is greater than one or 
underdispersed if the value is less than one. Overdispersion is more common in practice. The 
problem with overdispersion is that it may cause standard errors of the estimated parameters to be 
underestimated. A variable may appear to be a significant predictor, when in fact it is not. 


Deviance and Pearson Chi-Square for Binomial with 0/1 Binary Response and Ordinal 
Multinomial 


When r and m (event/trial) variables are used for the binomial distribution, each case represents m 
Bernoulli trials. When y is a binary dependent variable with 0/1 values, each case represents a 
single trial. The trial can be repeated for several times with the same setting (i.e. the same values 
for all predictor variables). For example, suppose the first 10 y values are 2 1s and 8 Os and x 
values are the same (if recorded in events/trials format, these 10 cases is recorded as 1 case 

with r = 2 and m = 10), then these 10 cases should be considered from the same subpopulation. 
Cases with common values in the variable list that includes all predictor variables are regarded as 
coming from the same subpopulation. When the binomial distribution with binary response is 
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used, we should calculate the deviance and Pearson chi-square based on the subpopulations. If we 
calculate them based on the cases, the results might not be useful. 


If subpopulations are specified for the binomial distribution with 0/1 binary response variable, the 
data should be reconstructed from the single trial format to the events/trials format. Assume the 
following notation for formatted data: 


Table 46-17 
Notation 

Notation Description 

Ns Number of subpopulations. 

rj Sum of the product of the frequencies and the scale weights associated with y = 1 in the 

jth subpopulation. So rjo is that with y = 0 in the jth subpopulation. 

mj Total weighted observations; mj = rj1 + rjo. 

Jjl The proportion of 1s in the jth subpopulation; yj1 = rj1/ mj. 

Hy The fitted probability in the jth subpopulation /:; would be the same for each case inthe 


jth subpopulation because values for all predictor variables are the same for each case.) 


The deviance and Pearson chi-square are defined as follows: 
— Vil l-y ow mi(yjir— Mj)” 
D=2S mj;4yi,In ta) +(l—y; in (7) and = ped Se Ss 
2 i a G aay . pj(1 = M3) 
then the corresponding estimate of the scale parameter will be 


¢ = —2— and d= —* 


Ns —Par Ns—Dx’ 


The full log likelihood, based on subpopulations, is defined as follows: 


te i Lees | m,! 

- ee j\$—¢@4 . 
f= Cp + 2 Ate (ae Ck etm it, 
j= _- 


j=l 


where ¢;, is the kernel log likelihood; it should be the same as the kernel log-likelihood computed 
based on cases before, there is no need to compute again. 


For the ordinal multinomial model, similarly, the data will be reconstructed based on 
subpopulations. Assume the following notation for reconstructed ordinal multinomial data: 


Table 46-18 
Notation 
Notation Description 
Ns Number of subpopulations. 
rij Sum of the product of the frequencies and the scale weights associated with the jth 
category in the ith subpopulation. 
mj J 


Total weighted observations for the ith subpopulation; m; = ‘3 Tif 
j=1 


The fitted probability for the jth category in the ith subpopulation. 


The deviance and Pearson chi-square are defined as follows. 
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» J “iJ 
n= >> Tig In (= } 


i=1 j=1 
and 
ns J A 2 
(rig — MT j) 


MiTi,j 


2 


X —— 


i=1 j=l 


with n.,(.J — 1) — d degrees of freedom, where d = J — 1 + p,. The corresponding estimates of 
the scale parameter will be 


- D 
Oo = ——— 
© ns(J—1)-d 
and 

, 
aan \ 
oe = — 
ns(J—1)-—d 


The full log likelihood, based on subpopulations, is defined as follows: 
Ths 
1 m,;! 
f= an +C= ey a S> {in ta}. 
i=1 ° Ppl TUT: 
where again (;, is the same as before. 
Information Criteria 


Information criteria are used when comparing different models for the same data. The formulas 
for various criteria are as follows. 


Akaike information criteria (AIC). —2/ + 2d 


2d.N 
(N—d-1) 


Finite sample corrected (AICC). —22 + 


Bayesian information criteria (BIC). —2¢ + dIn (NV) 
Consistent AIC (CAIC). —2¢ + d(In(N) +1). 


where @ is the log-likelihood evaluated at the parameter estimates. Notice that d = py if only is 
included; d = px + 1 if the scale parameter is included for normal, inverse Gaussian, gamma, and 
Tweedie, or k for the negative binomial distribution; for multinomial, d = J—1 + px. 


Notes 


m ¢ (the full log-likelihood) can be replaced with ¢, (the kernel of the log-likelihood) depending 
on the user’s choice. 
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m Ifthe scale parameter is specified by the deviance or Pearson chi-square, then the log 
likelihood is based on = 1, for fair comparison among different models. 


m@ When r and m (event/trial) variables are used for the binomial distribution, then the N used 


n 


here would be the sum of the trials frequencies; N = > fim,. In this way, the same value 


i=l 
results whether the data are in raw, binary form (using single-trial syntax) or in summarized, 
binomial form (events/trials syntax). 


Test of Model Fit 


The model fitting omnibus test is based on —2 log-likelihood values for the model under 
consideration and the initial model. For the model under consideration, the value of the —2 
log-likelihood is 


~2¢ (3) 


Let the initial model be the intercept-only model if intercept is in the considered model or the 
empty model otherwise. For the intercept-only model, the value of the —2 log-likelihood is 


~2¢ (50) 


For the empty model, the value of the —2 log-likelihood is 


Then the omnibus (or global) test statistic is 


S- a( ( ( 3) — €(Bo )) for the intercept-only model or 


S= 2(¢(3) — ¢(0)) for the empty model. 


S has an asymptotic chi-square distribution with r degrees of freedom, equal to the difference in 
the number of valid parameters between the model under consideration and the initial model. 

r =p, — 1 for the intercept-only model,; r = p,. for the empty model. The p-values then can 

be calculated accordingly. 


Note if the scale parameter or the ancillary parameter is estimated by the ML method in the model 
under consideration, then it will also be estimated by the ML method in the initial model. 


For the ordinal multinomial model, the value of —2 log-likelihood for the model under 
consideration is 


-2((B) 


the value of —2 log-likelihood for the thresholds-only model is 
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-2¢(B) 


where B'”) = (w\), B\? ‘ is the initial parameter values used in the iterative process. Then the 
omnibus test statistic is 


5 = 2(¢(B) — ¢(B)) 


and it has an asymptotic chi-square distribution with p,. degrees of freedom. 


When calculating the value of —2 log-likelihood of initial model, the following rules are used to 
handle the scale parameter or the ancillary parameter k in the initial model. 


If the scale parameter or the ancillary parameter is estimated by the ML method in the model 
under consideration, then it will also be estimated by the ML method in the initial model. 


If the scale parameter or the ancillary parameter is held fixed in the model under consideration, 
then the same value is fixed in the initial model. 


If the scale parameter is specified by the deviance or Pearson chi-square divided by degrees 

of freedom in the model under consideration, then that value will be held fixed in the initial 
model. Note that the log likelihood for the model under consideration would be revised; that 

is, based on @ = , so the log likelihoods for both models (the model under consideration and 
initial model) are calculated based on the same scale parameter value. This is to be consistent 
with the way chi-squares statistics in type I and III analyses are computed. Prior to version 16, 
the log likelihoods for both models are calculated based on @ = 1; thus the omnibus test statistic 
will be different between 15 and later versions. 


Default Tests of Model Effects 


For each regression effect specified in the model, type I and III analyses can be conducted. 
Type I Analysis 


Type I analysis consists of fitting a sequence of models, starting with the null model as the 
baseline model (for all distributions except ordinal multinomial), adding one additional effect, 
which can be an intercept term (if there is one), covariates, factors and interactions, of the model 
on each step. For the ordinal multinomial model, the baseline model is a thresholds-only model. 
Thus, the test depends on the order of effects specified in the model. On the other hand, type III 
analysis won’t depend on the order of effects. The reason for using the null model as the baseline 
model is to obtain the chi-square statistic for the first parameter which might be for an intercept or 
the first predictor variable. 


There are two kinds of test statistics for type I analysis: likelihood ratio statistics and Wald 
statistics. 
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Likelihood ratio statistics. Different formulae are used to calculate likelihood ratio statistics 
depending on how the scale parameter or ancillary parameter is handled. 


m= Estimated by ML method. The likelihood ratio statistics are twice the difference of the log 
likelihoods between two successive models. Unlike type III analysis, we don’t obtain the log 
likelihood of the constrained model based on the type I test matrix. 


Start by considering the first pair of models n=o (the null model with the log likelihood 
((0) and @ or k might be estimated) and n=x!B, + o (with the log likelihood e(3 i) and 3, and 
or k are estimated jointly) and the test statistic for the null hypothesis Hp :B, = 0 is 


§ =2(¢(41) -“(0)) 
The log-likelihood convergence criterion is used estimating the above two models. The 


tolerance level is the same as that used for the parameter estimation iterative process. A 
similar rule applies to usage of relative or absolute change. 


Note the optimal estimated scale parameter would be different for the above two models. If 
either log-likelihood is not available due to numerical problems in parameter estimation, then 
the test statistic, degrees of freedom and p-value are all set to system missing values. Similar 
rules will apply to other pairs of models below. 


Then consider the second pair of two models n=x!B, + o and n=xi , : x! p, + o, the test 
statistic for the null hypothesis Hy : B, = 0 based on B, is 


52(e(3u8)-e() 


Then consider the third pair of models 12! 2, | re Bp - Offset, and 


nari 3, : ad 3, | al 3, + offset. The likelihood ratio statistic for the null hypothesis 
Hy: 53=0 


Continue this way until all effects in the model are included. Similar convergence criterion 
applies to all reduced models except the full model. Each likelihood ratio statistic S has an 
asymptotic chi-square distribution with degrees of freedom equal to the difference in the 
number of parameters estimated in the successive models. The p-values can be calculated 
accordingly. 

m™ Set to a fixed positive value. The likelihood ratio statistics are calculated as above 
except @ or k is held fixed at that value. 


For the ordinal multinomial model, the scale parameter can be set to a fixed value or be 
specified by the deviance or Pearson chi-square divided by degrees of freedom. We briefly 
describe how the statistics can be constructed when it is a fixed value here. 


First, consider the first pair of two models n=y + 0 and n=w-x! B, + 0, the likelihood ratio 
statistic for the null hypothesis Hy : B, = 0 based on ¥ is 


5 =2(¢(9.8,) — 4) 


Then consider the second pair of two models n=wy + 0 and n=w—X? B, + 0, the likelihood 
ratio statistic for the null hypothesis Hy : 6, = 0 based on yw and Byis 


s = 2(¢(v.B,,B2) ~ €(%.6,)) 
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Again, S has an asymptotic chi-square distribution with degrees of freedom equal to the 
difference in the number of parameters in the successive models. 


™ Specified from the full model by the deviance or Pearson chi-square. In this case, the 
likelihood ratio chi-square and F statistics can be computed to assess the significance of each 
additional effect. 


Suppose that / f,1 is the log-likelihood from fitting a generalized model (model f) and that 
(s,1 is the log-likelihood from fitting a sub-model (model s). Both models are fit assuming 
the scale parameter equals 1. Then the test statistic is defined by 

is 25 ibs i) 

: e 


It has an asymptotic chi-square distribution with r degrees of freedom, where r is the difference 
in the number of parameters between the two models and ¢@ is estimated from the full model. 


In some references the test statistic is defined by 
Ge Di Es 


where D; is the deviance from fitting model f and D, is the deviance from fitting a sub-model 
s. However, this formulation can result in negative chi-square statistics for negative binomial 
responses where the ancillary parameter is estimated by maximum likelihood. 


Since @ is unknown and the estimator « is the deviance or Pearson chi-square statistic divided 
by its degrees of freedom, then (.V — p,.)¢/@ has an asymptotic chi-square distribution with 
N-px degrees of freedom. Thus, the F statistic can be defined as 


Fe3 


Under the assumption that 2(/.; — ¢,,)/@ and (N — p, )o/o are approximately independent, 
the F' statistic has an asymptotic F distribution with r and N—p, degrees of freedom, 

and the p-values can be calculated accordingly. Note for the negative binomial with the 
ancillary parameter k estimated by the ML method and with the scale parameter measured 
by the deviance or Pearson chi-square divided by its degrees of freedom, the degrees of 
freedom in the denominator for the F statistic are N — py — 1; for the binomial distribution 
with 0/1 binary response, the degrees of freedom for the denominator should be ng — py; 

for the ordinal multinomial model, the degrees of freedom for the denominator should be 


n (J —1)—(J —1+4+ pe). 
For type I analysis, model f is the higher order model obtained by including one 


additional effect in model s. For example, for the second pair of two models, model f is 
n=a} 3, al 3, + o and model s is naa! 3, + O. 


Wald Statistics. For each effect specified in the model, type I test matrix Lj is constructed 
and Ho: L,3= 0 is tested. Construction of matrix Lj is based on the generating matrix 
H,, = (xTox) XTX, where Q is the scale weight matrix with ith diagonal element w; and 


such that Lj8 is estimable. It involves parameters only for the given effect and the effects 
containing the given effect. If such a matrix cannot be constructed, the effect is not testable. 


Since Wald statistics can be applied to type I and III analysis and custom tests, we express Wald 
statistics in a more general form. The Wald statistic for testing L;3 = K, where L; is a rxp full 
row rank hypothesis matrix and K is a rx1 resulting vector, is defined by 
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f= (Lif = K) | (L.22,") . (Li8 = K) 


where 3 is the maximum likelihood estimate and & is the parameter estimates covariance matrix. S 
has an asymptotic chi-square distribution with ;-~, degrees of freedom, where ro = rank (LEL*). 


If ro <r, then LELT) isa generalized inverse such that Wald tests are effective for a restricted 
set of hypotheses L;-.‘3 — Kc containing a particular subset C of independent rows from Ho. 


For type I and III analysis, calculate the Wald statistic for each effect i according to the 
corresponding hypothesis matrix Lj and K=0. 


For the ordinal multinomial model, first consider partitions of the more general test matrix 
L = (Liy),L (B)), where L(y) = (11,...,1y; 1) consists of columns corresponding to threshold 
parameters and L(f) be the part of L corresponding to regression parameters. Consider 


matrix Lo = (lo, L(B)) where the column vectors corresponding to threshold parameters are 
J-1 
replaced by their sum lp = a 1,;. Then LB is estimable if and only if Lg = LoH.,, where 


j=l 
H,, = (XOX, ane is a (1 + p) x (1 + p) matrix constructed using X, = (1, — X). The 
Wald statistic for testing LB — K, where L is a r x (.J — 1 +p) full row rank hypothesis matrix 
and K is arx 1 resulting vector, is defined by 


S= (uy K) (txt!) (tg K) 


where B = (t, 8) is the maximum likelihood estimate and ¥ is the estimated covariance matrix 
(= could be the model based or robust estimator). The asymptotic distribution of S is X ; C where 
ro =rank (L=L"). 


For each effect specified in the model excluding intercept, a type I test matrix Lj is constructed and 


Ho: LiB = 0 is tested. Construction of matrix Lj is based on matrix H,, = xi ax, xix, and 
such that Lj is estimable. Thus the way to construct Lj (type I and IIT) for ordinal multinomial is 
the same as that for other distributions. 


Type III Analysis 


Similar to type I analysis, two kinds of test statistics are available for type I analysis: chi-square 
statistics and Wald statistics. 


Likelihood ratio statistics. The likelihood ratio statistics can be obtained as follows: 


Calculate the log-likelihood evaluated at the constrained maximum likelihood estimate under the 
constraint L;8 = 0 for each effect: 


(3 i) = max ¢ (u(G);y)stL,8 = 0, 
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where L,; is the type III test matrix for the ith effect. @ aes will be obtained by sequential 
quadratic programming. For more information, see the topic “Sequential Quadratic 
Programming”. 


The calculation of @ B and ( B_:) are based on how the scale parameter @ or the ancillary 
parameter k is handled: 


If @ or k is estimated jointly with B by ML method, then e( 3) is the log likelihood evaluated at 


6 and or k and (3.1) is the log likelihood evaluated at 6_,; and o_; or k_; under the constraint 
L;3 = 0 for each effect i. Note that the constraint should be expanded by including 9 or k so that 
the last element in expanded f is or k and the last element in expanded L ;is 0. 


If @ or k is set to a fixed value, then e( 3) and ¢ ( 3.) are calculated with » or k held fixed at that 
value for both unconstrained and constrained estimation processes. 


If @ is specified (; m the full model by the deviance or Pearson chi-square divided by degrees of 
freedom, then £(: a) and ¢ (3.1) are calculated with @ assumed to be 1. In addition, the deviance 
values for both unconstrained and constrained models are also calculated. 


Then calculate the likelihood ratio statistic for each effect i. 


If @ or k is estimated jointly with B by ML method or set to a fixed value, 


s, = 2(¢(9) -¢(..)) 


Then Sj has an asymptotic chi-square distribution with degrees of freedom r, where r is equal to 
the rank of the L; matrix. 


If @ is specified from the full model by the deviance or Pearson chi-square divided by degrees of 
freedom, 


S,= 2(4(2)-#(6-4)) and F; = ae. 
d T 

respectively. Then S; has an asymptotic chi-square distribution with degrees of freedom r. F has 

an asymptotic F distribution with r and N—p, degrees of freedom. Note for the negative binomial 

with the ancillary parameter k estimated by the ML method and with the scale parameter measured 

by the deviance or Pearson chi-square divided by its degrees of freedom, the degrees of freedom 

in the denominator for the F statistic are N — px — 1; for the binomial distribution with 0/1 binary 

response, the degrees of freedom for the denominator should be ng; — px; for the ordinal multinomial 

model, the degrees of freedom for the denominator should be n,(.J — 1) — (J —1+p,). 


Wald statistics. See the discussion of Wald statistics for Type I analysis above. L, is the type 
III test matrix for the ith effect. 
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Sequential Quadratic Programming 


Sequential quadratic programming is a method of linear constrained optimization that can be 
applied to type III analysis and custom tests. It has the general form: 


maxg ¢ (u (3) ;y) 
Sib =.K, 


where L is a rxp full row rank hypothesis matrix and K is a rx1 resulting vector. Note for the 
ordinal multinomial model, L is a r x (.J — 1+) full row rank hypothesis matrix for ordinal 
multinomial. To simplify the notation, we write the log-likelihood as ¢(,3) here. The Lagrange 
function with an rx1 vector of Lagrange multipliers is 


L(8,2) = €(8) +1 (LS — k) = £(8) + 9 ALi 8 — Ki) 
i=l 


The first order conditions with respect to 3 and A are 


OL(8,4) _ a0(s) 


Og Og 


Ly= (0),,.1 (dual feasibility equations) 


= L3 — K=(0],.,., (primal feasibility equations) 


We would like to find a solution (3*andA*) for the above KKT (Karush-Kuhn-Tucker) equations, 
which is a set of p + r equations. The method usually used is extensions of Newton Raphson’s 
method. First we replace the log-likelihood with its second-order Taylor approximation near to 
reform the problem 


s.t. L(3+6) =K 


This is a quadratic optimization problem with variable 6. We use the feasible start method solve 
the KKT equations. 


Feasible Start Method 


The feasible 3 values satisfy 5=K and belong to the domain of the log-likelihood. If the 
initial values of 3 are feasible, then Ld=0 and the constrained problem is almost the same as 
the Newton-Raphson method without constraints. The iterative process can be outlined briefly 
as follows: 


1. Find initial values 3© with L3\°) = K (see below), then compute £ (3'°’), s (3'°)) and H (3°°’). 
2 tet f=1, 


Find a solution of 8\’~'’andX'"” for the following KKT equations: 


H (3¢-)) alee =| | 
L 0 ri 0 (ptt) x1 
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4. Compute estimates of ith iteration: 
3 — al-1) 4 go“), then compute ¢ (3'"). 


5. Use step-halving method if £(3')) < €(8""!)): reduce € by half and repeat step (4). If the 
maximum number of steps in step-halving is reached but the log-likelihood is not improved, stop. 


Check if convergence criteria (see below) are met. If they are or the maximum number of iterations 
is reached, stop. The final vector of estimates is denoted by ("). Otherwise, go back to step (2). 


Initial Values 


The initial values for constrained optimization under the constraint L 8 = 0 for each effect iin type 
III analysis can be obtained by applying the method the initial values obtained for unconstrained 
parameter estimation with a constraint that type III contrast equals zero. Specifically, follow steps 
(1) to (3) of the method for computing initial values for parameter estimation (see the appropriate 
section under “Parameter estimation”), then solve the following KKT equations 


pxtw.x ] el 7 PxTWea) . 
Li 0 gy) 0 (ptr)x1 


The solution will be a feasible point. Then the initial value for @ or k can be obtained as before. 
For the ordinal multinomial model, initial values for unconstrained parameter estimation can be 
applied here because they are feasible values. 


Convergence Criteria 


We only consider the log-likelihood convergence criterion for the constrained optimization 
problem to speed the iterative processes here. If «, and relative or absolute change is user-specified 
for the unconstrained optimization problem, then they will be also apply here; otherwise, the 
internal default values will be used. 


Estimated Marginal Means 


There are two types of estimated marginal means (EMMEANS) calculated here. One corresponds 
to the specified factors for the linear predictor of the model and the other corresponds to those 

for the response of the model. EMMEANS for the predictor are equivalent to LSMEANS 
(least-squares means) used by SAS. EMMEANS for the response are equivalent to conditional 
marginals used by SUDAAN or conditional prediction used by Lane and Nelder (1982). 


EMMEANS are based on the estimated cell means. For a given fixed set of factors, or their 
interactions, we estimate marginal means as the mean value averaged over all cells generated 
by the rest of the factors in the model. Covariates may be fixed at any specified value. If not 
specified, the value for each covariate is set to its overall mean estimate. 


For the ordinal multinomial model, EMMEANS are not available. 
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EMMEANS for the Linear Predictor 
Calculating EMMEANS for the Linear Predictor 


EMMEANS for the linear predictor are based on the link function transformation. They are 
computed for the linear predictor. Since the given model with respect to the linear predictor is a 
linear model, the way to construct L is the same as that for the GLM procedure. Each EMMEAN 
for the linear predictor is constructed such that LB is estimable. 


Briefly, for a given set of factors in the model, a vector of EMMEANS for the linear predictor is 
created for all combined levels of the factors. Assume there are r levels. This rx1 vector can be 
expressed in the form v = L,3. The variance matrix of vis then computed by 


V(w)FLELT 


The standard error for the jth element of ¥ is the square root of the jth diagonal element of V(¥). 
Let the jth element of ¥ and its standard error be 0; and By, respectively, then the corresponding 


100(1 — @)% Wald confidence interval for v;,j = 1,...,7, is given by 


(8; —Z1-a/26v; 05+ 210/250; ) 
Comparing EMMEANS for the Linear Predictor 


We can compare EMMEANS for the linear predictor based on a selected contrast type, for which 
a set of contrasts for the factor is created. Let this set of contrasts define matrix C used for 
testing the hypothesis Hy : Cv = 0. A Wald statistic is used for testing given set of contrasts for 
the factor as follows: 


s = cw)! (cviwjct) (ce) 


S has an asymptotic chi-square distribution with ;, degrees of freedom, where 
ry = rank cv(w)ct . The p-values can be calculated accordingly. Note that adjusted p-values 
based on multiple comparisons adjustments won’t be computed for the overall test. 


Each row cl of matrix C is also tested separately. The estimate for the ith row is given by 
cly and its standard error by V chy v)c;. The corresponding 100(1 — «)% Wald confidence 
interval for is given by 


(civ + 21~-a/2 cFV(v)er] 


The Wald statistic for Ho : cly = O0is 


2D 


Ty : 
Beis) ee 
clv(v)ec; 
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It has an asymptotic chi-square distribution with 1 degree of freedom. The p-values can be 
calculated accordingly. In addition, adjusted p-values for multiple comparisons can also computed. 


EMMEANS for the Response 


EMMEANS for the response are based on the original scale of the dependent variable except for 
the binomial response with events/trials format (see note below). They can be defined as the 
estimator of the expected response for a subject conditional on his/her belonging to a specified 
effect and having the averages of covariates. Note that as for the so called predicted marginals 
used by SUDAAN or marginal prediction used by Lane and Nelder (1982), we will not offer them 
because they require some assumptions about the distribution of the predictor variables. 


Calculating EMMEANS for the Response 


The way to construct EMMEANS for the response is based on EMMEANS for the linear 
predictor. Let M.. be EMMEANS for the response and it is defined as 


M.=g7! (3) 


The variance of EMMEANS for the response is 


n ‘) a yy pas ‘) =1 yy 
V (Mt) = diag Og (4) LEL" diag oi) 
. Ov; Ov; 


J 


where diag(0g~*( b;)/ Jv;) is arxr matrix and Jg71( (;)/Ov, is the derivative of the inverse of 
the link with respect to the jth value in Vv and 0,-1(0;)/00; = 1/9(M.;) where g(M,;) is from 
Table 46-8. The standard error for the jth element of M, and the corresponding confidence 
interval are calculated similarly to those of ¥. For more information, see the topic “EMMEANS 
for the Linear Predictor”. 


Note: M.. is EMMEANS for the proportion, not for the number of events when events and trials 
variables are used for the binomial distribution. 


Comparing EMMEANS for the Response 


This is similar to comparing EMMEANS for the linear predictor; just replace ¥ with M.. and 


V(v) with J ( A a For more information, see the topic “EMMEANS for the Linear Predictor”. 


Multiple Comparisons 


The hypothesis 7p : Cv =0 can be tested using the multiple row hypotheses testing technique. 
Let c/ be the ith row vector of matrix C. The ith row hypothesis is Hy; : ¢/ v = 0. Testing Hy is the 
same as testing multiple non-redundant row hypotheses { /7;; ie , simultaneously, where R is the 
number of non-redundant row hypotheses, and Hj; represents the ith non-redundant hypothesis. A 
hypothesis Ho; is redundant if there existsanother hypothesis Ho;,j # i such that c; = acj,a # 0. 
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Adjusted p-values. For each individual hypothesis Ho;. test statistics can be calculated. Let 
p; denote the p-value for testing Hp; and p* denote the adjusted p-value. The conclusion from 
multiple testing is, at level a (the family-wise type I error), 


reject Ho; : c7 v = 0 if pt < a: 


reject Hy : Cv = 0 if min; (p}) < a. 


Several different methods to adjust p-values are provided here. Please note that if the adjusted 
p-value is bigger than 1, it is set to 1 in all the methods. 


Adjusted confidence intervals. Note that if confidence intervals are also calculated for the 
above hypothesis, then adjusting confidence intervals is required to correspond to adjusted p- 
values. The only item needed to be adjusted in the confidence intervals is the critical value 
from the standard normal distribution. Assume that the original critical value is Z;_,/z and the 
adjusted critical value is z*. 


LSD (Least Significant Difference) 
The adjusted p-values are the same as the original p-values: 
P; = Pi 


The adjusted critical value is: 


Bonferroni 
The adjusted p-values are: 
p; = Rp; 
The adjusted critical value is: 


~ * — 


Zz a 
= 1 2R 


Sidak 


The adjusted p-values are: 


p; =1-(1—pi)" 


The adjusted critical value is: 
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Sequential Bonferroni 
The adjusted p-values are: 


Rpay t=1 
Pa) max ((R-4+ 1)pq)sPi1)) 122 


The adjusted critical values will correspond to the ordered adjusted p-values as follows: 


21-= ifi=1 
i) =< 4-2 if P(i) = (R- ‘+ 1)P(i)- fori > 2. 
24-1) if pi) a Put) fori > 2 
Sequential Sidak 


The adjusted p-values are: 


R 

eo 1—(1—pq)) i=l 
) = t—i4 P oN 
Pi) max (1 ~ (1 = py)" ‘Pa) i>2 


The adjusted critical values will correspond to the ordered adjusted p-values as follows: 


Ai) =) raven  ifpiy=1— (1 pa), fori > 2. 
( 


) 
i)~P(i-1): fori > 2 


Comparison of Adjustment Methods 


A multiple testing procedure tells not only if Ho is rejected, but also if each individual Hp; is 
rejected. All the methods, except LSD, control the family-wise type I error for testing Ho; that is, 
the probability of rejecting at least one individual hypothesis under Hp. In addition, sequential 
methods also control the family-wise type I error for testing any subset of Ho. 


LSD is the one without any adjustment, it rejects 1 too often. It does not control the family-wise 
type I error and should never be used to test Hy. It is provided here mainly for reference. 


Bonferroni is conservative in the sense that it rejects Hy less often than it should. In some 
situations, it becomes extremely conservative when test statistics are highly correlated. 


Sidak is also conservative in most cases, but is less conservative than Bonferroni. It gives the 
exact type I error when test statistics are independent. 


Sequential Bonferroni is as conservative as the Bonferroni in terms of testing Hy because the 
smallest adjusted p-value used in making decision is the same in both methods. But in term of 
testing individual Ho,, it is less conservative than the Bonferroni. Sequential Bonferroni rejects at 
least as many individual hypotheses as Bonferroni. 
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Sequential Sidak is as conservative as the Sidak in terms of testing Hy, but less conservative 
than the Sidak in terms of testing individual Hy;. Sequential Sidak is less conservative than 
sequential Bonferroni. 


Scoring 


Scoring is defined as assigning one or more values to a case in a data set. Two types are considered 
here: predicted values and model diagnostics. 


Predicted Values 


Due to the non-linear link functions, the predicted values will be computed for the linear predictor 
and the mean of the response separately. Also, since estimated standard errors of predicted values 
of linear predictor are calculated, the confidence intervals for the mean are obtained easily. 


Predicted values are still computed as long all the predictor variables have non-missing values 
in the given model. 


Predicted Values of the Linear Predictors 
Wi =al3 | 0; 


For the ordinal multinomial model, a predicted value of the linear predictor for category j is 
given by 


A 


fig=0j—x}B toi,j=1,...,J—1. 


Estimated Standard Errors of Predicted Values of the Linear Predictors 


on = Valse: 


For the ordinal multinomial model, the estimated standard error of #);,; is given by 


Ons (1. TE (1,,).s SL eeonp = 1, 
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where X; is a reduced parameter estimates covariance (1 + p)x(1 + p) matrix from X. Suppose 
x for ordinal multinomial models has the following form: 


T,, F¥, 

au 8 
TD, Voz 
Bi —~8,8 


II 


y 


01,1 "ts O71 (J-1) O1,J "t* O1,(J—14+p) 

- O(J-1),1 °° F(J-1),(J-1) O(J-1),J °** %J-1),(J-1+p) 
OF,1 “*) OF(J-1) OJ,J “OF J—1+p) 
F(J—-1+p),1 °*° %(J—1+p),(J-1) F(J-1+p),J *** %J—1+p),(J—-1+p) 


then 2 will have the following form as it takes the corresponding elements in the jth row and 
column of & and 24.3: 


03,j [ojs °°*  F5(F-14p) | 
3; = oJ,j 
: Xg,8 
95,5 Oj, J "tt OF J—1+p) 
Osj OJ,J "t+ OF (J—-1+p) 
F(J—-1+p),j %(J-1+p),J °° %(J—1+p),(J—-14+p) 


Predicted Values of the Means 


[hi =9"! (af é | 0:) 


where g_! is the inverse of the link function. For binomial response with 0/1 binary response 
variable, this the predicted probability of category 1. 


For the ordinal multinomial model, a predicted value of the cumulative response probability 
for category j is given by 


Vij = gy _ xl t 01) » Sheed Sl 
Confidence Intervals for the Means 


Approximate 100(1—a)% confidence intervals for the mean can be computed as follows 


a («I 3 +0; +24~0/25n) 


at 


Approximate 100(1—a)% confidence intervals for the cumulative response probability can be 
computed as follows 
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_— ; T ‘ . 
Gq (jx; B + 0; + 21~a/2%n, ) a acre Weer re esl © 


If either endpoint in the argument is outside the valid range for the inverse link function, the 
corresponding confidence interval endpoint is set to a system missing value. 


Predicted category for binomial and multinomial distributions 


For the binomial distribution with 0/1 binary response variable, the predicted category is 


c(x;) = 1 (or success) if 4; 20.5 
‘/ ~~) 0 (or failure) otherwise 


For the ordinal multinomial model, the predicted category is the one with the highest predicted 
probability; that is 


c(X;) = arg max 7j,;. 
' ; 
n 
m If there are ties in determining (x ;) choose the category with highest NV; = »: bei oee 
i=1 
m If there are still ties, choose the one with lowest category number. 


Diagnostics 


In addition to predicted values, we can calculate some values which would be good for model 
diagnostics for all distributions except the ordinal multinomial. 


Leverage 


The leverage value h; is defined as the ith diagonal element of the hat matrix 
H = wx (xw.x) —xTwl? 


where the ith diagonal element for W , is 
Wi 1 
Wei = _° 


eo V (mi) (g (11i))” 


Raw Residuals 


_R - 
r= Yi — bi 


where yj; is the ith response and ji; is the corresponding predicted mean. Note for binomial 
response with a binary format, y values are O for the reference category and 1 for the category 
we are modeling. 


Pearson Residuals 
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The Pearson residual is the square root of the ith contribution to the Pearson chi-square. 


5 ae [wi 
rio (Yi i Bids iy Gh 
Deviance Residuals 


The deviance residual is the square root of the contribution of the ith observation to the deviance, 
with the sign of the raw residual. 


rP = sign (y; — fli) Vd; 


where dj is the contribution of the ith case to the deviance and sign() is 1 if its argument is positive 
and —1 if it is negative. 
Standardized (and Studentized) Pearson Residuals 


SP __ ; / wi eS Ph 
i (Yi Bids) @VGu)d—hiy — /% V oU—hi) 


Standardized (and Studentized) Deviance Residuals 


SD 6 Ve fqe fa D [4 
ry = sign(y; — fui)Vd; \ bh(1—h;) rv; V ob(l—h,) 


Likelihood Residuals 


rb = sign(y; — jis) hi(r$?)? + (L— hy) (r8?)? 
Cook’s Distance 


= 15 \ 2 
Ci = petting (r?”) 


Generalized Estimating Equations 


Generalized estimating equations (GEE) extend the GZLM algorithm to accommodate correlated 
data. The algorithms of generalized estimating equations are based on Liang and Zeger (1986) 
and Diggle, Heagerty, Liang and Zeger (2002). 


Data Format 


The data formation in GEE is very different from that in GZLM, so the data used in GEE need to 
be formatted appropriately. The structure of the correlated data has two dimensions: there are 
some independent subjects (the subject effect) where each subject has correlated measurements 
(the within-subject effect). 


The subject effect can be a single variable or a list of variables connected by asterisks (*). In 
general, the number of subjects equals to the number of distinct combinations of values of the 
variables except under some circumstances (see example below). 
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The within-subject effect defines the ordering of measurements within subjects. If specified, it 
can be a single variable or a list of variables connected by asterisks (*). The start and end of 

the within-subject effect could be different for each subject, so the whole data file is checked 

to find the complete set of measurements which include all distinct combinations of values of 
within-subject effect from all subjects. The dimension of the complete set of measurement will be 
the dimension of the working correlation matrix (see “Model ” for more information). If some 
measurements do not appear in the data for some subjects, then the existing measurements are 
ordered and the omitted measurements are treated as missing values. 


Note that the within-subject effect might not be equally spaced. This is relevant for the time 
dependent working correlation structures. We will assume that the lags based on the data ordered 
by the within-subject effect are appropriate and fit the model. 


The data have to be properly grouped by the subject effect and sorted by the within-subject effect 
if it exists. If you specify not to sort the data file (SORT=NO), we assume that the cases in the data 
file are pre-sorted. If you specify to sort the data file (SORT=YES), the data will be sorted internally 
by the subject effect and the within-subject effect, if the within-subject effect is specified. 


Consider the following artificial data: 


Table 46-19 
Example data 


center id year y x1 
[Ait ‘Y9t 4 0 
A 11 93 5 1 
A 12 93 5 1 
A 11 94 6 1 
A 12 94 6 0 
A 12 95 7 1 
B 1 91 6 0 
B 1 94 3 0 
B 2 93 5 1 
B 2 95 7 0 
B 2 94 8 1 


Suppose the subject effect is specified as center*id. The number of subjects or clusters depends on 
whether the within-subject effect is specified or not and whether the data are indicated to be sorted 
or not. Thus we consider the following cases: 


Within-subject effect is specified, data will be sorted by procedure (SORT=YES) 


There are four distinct combinations for the subject effect: (center*id) = (A*11), (A*12), (B*1), 
(B*2). The data will be grouped internally based on them, so the number of clusters or groups = 
4. The complete set of measurements = (91, 93, 94, 95) with the dimension = 4, the maximum 
and minimum sizes of the within-subject effect are 3 and 2, respectively. Note the measurements 
for the within-subject effect are not equally spaced, we assume the measurements are spaced 
appropriately when calculating the time dependent working correlation structures. 
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Figure 46-1 
GEE model information about the data 


Number of Levels Subject Effect 


id 
Within-Subject Effect — 


Number of Subjects 


Number of Measurements per Subject Minirnurm 
Maximum 


Correlation Matrix Dimension 


The data file is then organized internally as follows (subject and withinsubject are internal 
variables): 


Table 46-20 

Data file structure 

jcenter jid | fyear fy [xl [subject | withinsubject | 
A 11 91 4 0 1 1 
A 11 93 5 1 1 2 
A 11 94 6 1 1 3 
A 11 95 1 4 
A 12 91 2 1 
A 12 93 5 1 2 2 
A 12 94 6 0 2 3 
A 12 95 7 1 2 4 
B 1 91 6 0 3 1 
B 1 93 3 2 
B 1 94 3 0 3 3 
B 1 95 3 4 
B 2 91 4 1 
B 2 93 5 1 4 2 
B 2 94 8 1 4 3 
B 2 95 7 0 4 4 


Within-subject effect is not specified, data will be sorted by procedure (SORT=YES) 


There are still 4 distinct combinations for the subject effect and the number of clusters or groups = 
4. The dimension of the working correlation matrix is3 which is determined by the maximum size 
of measurements from all subjects, the maximum and minimum sizes of repeated measurements 
are 3 and 2, respectively. A summary is as follows: 


Figure 46-2 
GEE model information about the data 


Number of Levels Subject Effect 


Number of Subjects 


Number of Measurements per Subject |Minirmum 
Maximum 


Correlation Matrix Dimension 
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The data file is then organized internally as follows (subject and withinsubject are internal 


variables): 

Table 46-21 

Data file structure 

center id year y xl subject withinsubject 

A 11 91 4 0 1 1 
A 11 93 5 1 1 2 
A 11 94 6 1 1 3 
A 12 93 5 1 2 1 
A 12 94 6 0 2 2 
A 12 95 7 1 2 3 
B 1 91 6 0 3 1 
B 1 94 3 0 3 2 
B 1 : 3 3 
B 2 93 5 4 1 
B 2 95 7 4 2 
B 2 94 8 1 4 3 


Data will not be sorted by procedure (SORT=NO) 
When data are not to be sorted, the within-subject effect will be ignored whether specified or not. 


From the original data file, we notice that the same combinations of values for the subject effect 
are in different blocks, so they will be considered as different clusters. For example: 


The 1st cluster (certer*id = A *11) includes the 1st and 2nd observations. 

The 2nd cluster (center*id = A*12) includes the 3rd observation. 

The 3rd cluster (center*id = A*11) includes the 4th observation. 

The 4th cluster (center*id = A*12) includes the 5th and 6th observations. 

The 5th cluster (center*id = B*1) includes the 7th and 8th observations. 

The 6th cluster (center*id = B*2) includes the 9th, 10th and 11th observations. 


So the number of clusters =6. The dimension of the working correlation matrix is 3, the maximum 
and minimum sizes of repeated measurements are 3 and 1, respectively. A summary is as follows: 


Figure 46-3 
GEE model information about the data 


Number of Levels 


Subject Effect 


Number of Subjects 


Number of Measurements per Subject |Minimum 


Correlation Matrix Dimension 
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The data file is then organized internally as follows (subject and withinsubject are internal 


variables): 

Table 46-22 

Data file structure 

center id year y xl subject withinsubject 

A 11 91 4 0 1 1 
A 11 93 5 1 1 2 
A 11 . . : 1 3 
A 12 93 5 1 2 1 
A 12 2 2 
A 12 : . : 2 3 
A 11 94 6 1 3 1 
A 11 3 2 
A 11 : 3 3 
A 12 94 4 1 
A 12 95 7 1 4 2 
A 12 . ‘ : 4 3 
B 1 91 6 0 5 1 
B 1 94 3 0 5 2 
B 1 : . : 5 3 
B 2 93 5 1 6 1 
B 2 95 7 6 2 
B 2 94 8 1 6 3 

After reformatting the data, we assume there are i = 1, ..., K subjects or clusters where each 

subject or cluster has t = 1, ..., nj correlated measurements. Note now that nj = n2 =... = nk. The 


following notations should be applied to the reformatted data, not the original data. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 46-23 

Notation 

Notation Description 

K Number of subjects (clusters or groups) in the data set. It is an integer and K>1. 

nj Number of complete measurements on the ith subject. It is an integer and nj>1. 
Total number of measurement;n = D/~ ni. It is an integer andn> 1. 

Pp Number of parameters (including the intercept, if exists) in the model. It is an integer 
and p> 1. 

Px Number of non-redundant columns in the design matrix. It is an integer and px >1. 

J x 1 dependent variable vect Ty ey api oe 
n ependent variable vector. y — a me with y, =[yi1,...,Yin,] for 
each i. 

r n x 1 vector of events for the binomial distribution; it usually represents the number 


of “successes”. All elements are non-negative integers. 
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Model 


Notation Description 


m n x 1 vector of trials for the binomial distribution. All elements are positive integers 
and mj> rj, i=1,...,n 


H n x 1 vector of expectations of the dependent variable. 

n n x 1 vector of linear predictors. 

Xx n X p design matrix. The vector for the tth measurement on the ith subject is 
Vit =([Citi,---, Lip] » t= 1,....Kandt=1,..., nj with vj; =1 if the model has an 
intercept. 

3 p x 1 vector of unknown parameters. The first element in ( is the intercept, if there is 
one. 

a n x 1 vector of scale weights. If an element is less than or equal to 0 or missing, the 
corresponding case is not used. 

f n x 1 vector of frequency counts. Non-integer elements are treated by rounding the 


value to the nearest integer. For values less than 0.5 or missing, the corresponding 
cases are not used. 

N n 
Effective sample size. N = S- fini. 


i=1 


GEE offers the same link functions and distributions as those for GZLM. The generalized 
estimating equation is given by 


T 


- (FE) V 504-1) = Whper 


where 


Till st5 Lily 
1 
= CL 7 . 
diag (55) . 
Vin A a Tin p 
Hirt Lip 
g (pir) ea) 
— a d 
Linjl {n4 Lin,p 
g' (Hin;) 9 (Hin) J nixp 


diag (1/9 (pi; )),3 =1,...,,;, isan n; x n; matrix, V; is the assumed covariance matrix of yj 
and V; is a generalized inverse of V ; 


If the measurements within the ith subject are independent like in GZLM, then V, is a diagonal 
matrix which can be decomposed into 


Vi = (¢A7 tn Ay”) 


n,Xnj; 
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where A; = diag V(ji;;)/wi;),f =1,-.., 1), is an nj x nm; matrix and I, is an n; x n; identity 
matrix. However, if the measurements within the ith subject are correlated, we simply replace the 
identity matrix I,,, with a more general correlation R(a) 


V; = bAIPR (a ) Moe 


where R (a) is ann; x n; “working” correlation matrix which can be estimated through the 
parameter vector a. Since R.(a) usually doesn’t have a diagonal form, neither does V;. If R (a) is 
indeed the true correlation matrix of the y,’s, then V ;is the true covariance matrix of y , 


Ordinal Multinomial Model 


For ordinal multinomial GEE models, we need to transform original response variable and define 
some notation as follows: 


Table 46-24 
Notation 
Notation Description 
J Number of values for the ordinal response. It is an integer and J > 2. 
y es 
n Xx 1 dependent variable vector. y = yi. ae YK] with y; = [yi,...,: Yin ie for 
each i. 
Z K x nj x (J — 1)) * 1 transformed dependent variable vector. 
T T T T T 
Z= [2: pees 2% | 44 = [zit see in, | Zit = [Yit,--+5Yit,J-1] and 
lify;, = 7 
Yt = : 
Wid otherwise. 
T K x nj X (J — 1)) x 1 conditional response probability vector. 
T T T T T 
= lm: peek Tx t= [Fis weep Min, | ANd My = [mit,,.-.,Tit,J—1] > 
where 7;;,; is the conditional response probability of measurement t on subject 
i for category j given the observed independent variable vector; that is, 
Tit, = P (yi = jlxit) and wiry = Yury — Yir,j-1forg = 1,..., J 
Vit. Conditional cumulative response probability of measurement t on subject i for category 
j given observed independent variable vector; that is, yix,; = P (yi < j|xit). 
Mit .3 Linear predictor value of measurement t on subject i for category j. It is related to 
yit,j through a cumulative link function. 
Sd (J—1) x 1 vector of threshold parameters; y = (uy, wW2,...,Ws-1 )’ and 
uy << We See CW. 
3 p x 1 vector of regression parameters associated with model predictors; 
B= (31, B2,.--,8p) . 
B 


T 
(J-1+ p) x 1 vector of all parameters; B = (v", p") 


The generalized estimating equation for estimating parameters B is given by 


Kk 
_ (ON; = 
(8) =A FE) We - =) = Dla tee 
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OT i1 1 eee OT in 1 Oni 1 iS fal OT 1 1 
Ow, Ow) OB, 03, 
Omi —1 per OT its =1 Omi s—1 eee Onis 1 
ye, Ow Ow OB 08, 
Oui = 1 j=1 Pr I 
OB Oring On ing. Oming a Onin. 
Ow, Ows—1 OB, aB, 
OT in, J—1 OT in, J-1 OT in, 5-1 OT in; F—1 
Tu, Osa OB, 08s. ig vig-aynersiap) 


and for all t= 1,..., 1), 


Onitg — OVitg 


= 1,. J—-1 

OY; Onit,j ’ ’ ’ 

Ont = Fit j—t ek, Gat 
OVj-1 Onitj-1 
Onitg _ . Be z 
“ag forj <lorj—i>1,j=1,...,J/—land/=1,...,J—1, 
Ow] 
On; Oy; 

ss = : at Lite, € = tg MD, 

OB¢ OnitA 


Onit.; Ove;  OVitj 
its = ( yit,j jit.) Yin d = Bye J Land = 1p 
OB, Onit Ont j-1 


and V; is ais a generalized inverse of V; Here 
1/2 1/2 

Vi =A, RiA;", 

where 


A; = diag(Ajy, ...,Ain,), 


Lis s22 
Ait = 7 ae (rie — Tit1)s-+- Mit,J—1(1 — Tit,s-1))) 
WwW, 
and 
1/2 1/2 
Ai Vig jPil2 Pin, 
l p Ay ?y Aw? 
2 rm) : 
Ri (q) = = a eae Pi2n, 
© . e 
: =) ee ene 
A.’ in od 
inal Pin,2 ing Yering 
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and note that there is a subscript i in R,(a) which means each subject has different working 


correlation matrix. In fact, only the diagonal blocks are different for different subjects, 


the off-diagonal blocks will be the same for all subjects. The diagonal blocks of R,(a), 
AR Ving”, with 


-1/2 1 eee —1/2 —1/2 
Ay =wy xdiag ({7ni(1 —Tt)p Sega = Tit,J—1) } ) 


and 


d : = T 
Vit = a x diag (mi, eh yn) = mitt | 


Tit, (1 — Wit) =e, Tit.2 — Nit, 1Mit,J—1 | 
F —Tit,2Tit 1 mito(1 — meg) °° | —Tit2TitJ—1 
Ds | 
Wit 
Tit, J-1 "it, | Mit, J-1L it .2 noe Mit, JI (1 = Tt, g-4) 


are specified entirely by z;.In particular, the diagonal elements of tA! "VitAZ * are 1 and 
off-diagonal (j,/) elements are 


Tit Mit] 


{mie j(1 — mie) Tita — mits) } 


which are not constant and depend on the categories j and / at measurement t. The unknown 
off-diagonal blocks of R,(a) are the (.J — 1) x (J — 1) matrix p,,,,(@),u,v =1,...,1;, which 
we need to parameterize and estimate them. 


Working correlation matrix 


The working correlation matrix is usually unknown and should be estimated. We use the 
estimated Pearson residuals 


Wit 


rit = (Yt — Hit) V (jit) 
Z 


from the current fit of the model to estimate a. 


For the ordinal multinomial model, we define estimated Pearson-like residuals as follows 


. a Wit 
rit = (Yity — rid aoa was) 
it.j it.j 


and the vector 


T 
Tit = [rita, vee By ell 


The following structures are available. 
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Independent 


The independent correlation structure is defined as: 


ee tinea) 
Ruy = + otherwise 


For the ordinal multinomial model: 
Dig = Oe Nay 9 Hs 
No parameters need to be estimated for this structure. 


Exchangeable 


The exchangeable correlation structure is defined as: 


R= ft iteg=wv 
‘uv ) a otherwise 


1 parameter is estimated as follows: 


ik 6 K i 
a i Poser Sirit it’ ji 1 s - 
(3 we fin’ (ny- 1)) —7 N’ — pr 

9 24-1 Jil i i Pr 


2 
fit it Jo 

i=1 t=1 

where V’ = 4 f, n ; andr’; is the number of non-missing measurements on the ith subject. 


For the ordinal multinomial model: 


Kp 
ge pay Ji dtet’ a (Titty + riTH) 
K poe r 
($08) fan's (n's - 1) -— (J - 1+ pr) 


and p;,,5 =O, u.u =1,...,n; andu Fv. 
AR(1) 


The first-order autoregressive correlation structure is defined as: 


es Litt =e 
‘uv = 9 qlu-vel otherwise 


1 parameter is estimated as follows: 


_=Fe 


1 Kon; ; 5 
Hy ages 
N'- Dot att Ne 
Prjai t=1 
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where n’’; is the number of non-missing pairs used in the numerator part for the ith subject. If 


there is no non-missing measurement for the ith subject, n”; = n; — 1. 


For the ordinal multinomial model: 


K 
Sint fie 3 (rey + riety) 


a= 


(SK fin's) - (1-14 pr) 


and p,,,, =al"~"l,u,u =1,..., 0, anduF v. 


M-dependent 


The m-dependent correlation structure is defined as: 


1ifu=v 
Rage 4 iy og lt | ae 
0 otherwise 


m parameters are estimated as follows: 


K ni-Jj 
S- S Siritrit j 
i=l t=1 
Op = Se j 
i N Pr 
Jil; = 
) fini; Pa 
i=1 


where ni; ’ is the number of non-missing pairs for the ith subject in calculating a. If there is no 
non- tiane measurement for the ith subject, fii =n — 7. 


For the ordinal multinomial model: 


n= Ti sec ae 
Die fi Dee 1? 3 (rurh, jtriessrh) 


aj = 


ae O},—v| if |u—v| <m 
uy 0 otherwise 


Se fry) —(J—1+p,) 
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Unstructured 


The unstructured correlation structure is defined as: 


R,.-i! ifu=v 
uv ) Quy, otherwise 


+n; (n; — 1) parameters are estimated as follows: 


K 
yates 
i=1 
K 
>. fi Ti uv — Px 


i=1 


ni 


ie 3 
I\ rap de Da frie | 


T j=1 t=1 


Quy = 


where Jj... = 1 if the ith subject has non-missing measurements at times u and v; 0 otherwise 


For the ordinal multinomial model: 


K 
. Siti urdy 
i=1 


Qaiy = K 
S- filiuw | — (J -—1+ pe) 
i=1 
and P;,.5 = Cw, Wu =1,...,n;andu #v 
Fixed 


The fixed correlation structure is defined as: 


; ee ifu=v 
uv) Vy, otherwise 


where 7,,,,is user-specified 
Fixed correlation structures are not allowed for ordinal multinomial models. 
No parameters need to be estimated for this structure. 


Notes 


m When the scale parameter is updated by the current Pearson residuals, the denominator for the 
@ parameter vector is an estimator of the scale parameter. 


m™ The denominators in the above equations and in the estimator of the scale parameter are all 
adjusted by the number of non-redundant parameters (not subtracted by px). The user can 
specify that these adjustments not be used so that the numerator and denominator parts are 
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invariant to subject-level replication changes of the data. If the denominators are non-positive; 
that is, if the summation part is smaller than or equal to px, then only the summation part is 


used. 
Estimation 


Having selected a particular model, it is required to estimate the parameters and to assess the 
precision of the estimates. 


Parameter Estimation 


The algorithm for estimating model parameters using GEEs is outlined below. Note the scale 
parameter or the ancillary parameter k is not a part of parameter estimation and see below on 
how to deal with them. 


Some definitions are needed for an iterative process: 


Table 46-25 
Notation 
Notation Description 
M The maximum number of iterations. It must be a non-negative integer. If the value is 
O, then initial parameter values become final estimates. 
Nu The number of iterations between updates of the working correlation matrix. It must be 


a positive integer. 
CORRTYPE The specified working correlation structure. 
€p>€H Tolerance levels for different types of convergence criteria. 


Abs A 0/1 binary variable; Abs = 1 if absolute change is used for convergence criteria and 
Abs = 0 if relative change is used. 


1. Input initial values 3 (©) and/or 6© or if no initial values are given, compute initial estimates 
with an independent generalized linear model. 


2 Compute the working correlation R(a) based on 3 (0), Pearson residuals and a specified working 
correlation structure (CORRTYPE). Check if R(q) is positive definite for exchangeable, 
m-dependent and unstructured structures. If it is not, revise it to be equal to a4 (R(a) + I), where 
Tis an identity matrix and ¢ is a ridge value such that the adjusted matrix is positive definite. If a 
fixed correlation matrix is specified by the users and it is not positive definite, issue a warning 
and stop. Then compute the initial estimate of the covariance matrix of y, evi? '), the generalized 


estimating equation s(9), and generalized Hessian matrix H() (see formulae below) based on 


30 and ve. 
3. Initialize v=0. 
4 Set v=vt1. 
5, 


Compute estimates of vth iteration 


ge) gee) (H-)) gv). 
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& if u/N,, is a positive integer, update the working correlation, checking for positive definiteness 


as above. 

Compute an estimate of the covariance matrix of yiand its generalized inverse 
Vi") = 9A; PR@A,” and (Vi) = $Ap'?R@) AS”. 

For the ordinal multinomial model, replace R(a) with R, (a) in the above equations. 


8 Revise s(”) and H“) based on A”) and ve 


20 = (B)" (UP) 00 
7 "98 i Ee 


a ae 
HO => 498) (vi) (GH); 


For the ordinal multinomial model, 


K « . = 
#0=—S'6(Ge) (°) (5) 


Check the convergence criteria. If they are met or the maximum number of iterations is reached, 
stop. The final vector of estimates is denoted by (3 (and W for the ordinal multinomial). Otherwise, 
go back to step (4). 


Scale Parameter Handling 


If no initial values are given,, the initial values are computed with an independent GZLM. The 
ways to deal with the scale parameter in the GZLM step (1) and the GEE step (7) are as follows: 


m™ For normal, inverse Gaussian, gamma, and Tweedie response, if the scale parameter is 
estimated by the ML method in step (1), then in step (7) 6 would be updated as 
: K nm 
o= ri yy faiths 
i=1 t=1 
where r;, is the Pearson residual, and n; is the number of non-missing measurements on 
the ith subject. 


m Ifthe scale parameter is set to a fixed value in step (1), then @ would be held fixed at that 
value in step (7) as well. 


Convergence Criteria 


We consider parameter convergence and Hessian convergence. For parameter convergence, 
we consider both absolute and relative change, but for Hessian convergence, we only consider 
absolute change because the log-likelihood values used as the denominator for relative change 
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are not valid for GEE. Let ep and ey be given tolerance levels for each type, then the criteria 
can be written as follows: 


pa | < @ ifrelative change 


Sa 


ma Xj ( 
Parameter convergence: 


MAX j ( 


Hessian convergence: (s) : ( H©)) i. (s) < cy if absolute change 


(v) (v-1) 
B B 
J J 


) < ep if absolute change 


If the Hessian convergence criterion is not user-specified, it is checked based on absolute change 
with ej = 1E-4 after the log-likelihood or parameter convergence criterion has been satisfied. If 
Hessian convergence is not met, a warning is displayed. 


Parameter Estimate Covariance Matrix, Correlation Matrix and Standard Errors 
Parameter Estimate Covariance 


Two parameter estimate covariance matrices can be calculated. One is the model-based estimator 
and the other one is the robust estimator. As in the generalized linear model, the consistency of the 
model-based parameter estimate covariance depends on the correct specification of the mean and 
variance of the response (including correct choice of the working correlation matrix). However, 
the robust parameter estimate covariance is still consistent even when the specification of the 
working correlation matrix is incorrect as we often expect. 


The model-based parameter estimate covariance is 


=m = —H; 


K a 
where H) is the generalized inverse of H, = -S> fi (H) V; (3 : iF 
OL 
7=1 


K a 
For the ordinal multinomial model, H,; = -S- fi () V~ (4 \ 
G 
i=l 


The robust parameter estimate covariance is 


Yr = H, H2H, 
an OL; : - _ (a . 
where H2 = » fi ( + V; cov(y;) V; (3) and cou (y,;) can be estimated by 
(vi —H)(y;— m7 
For the ordinal multinomial model, Hz = S fi (FE) V; cov (Z;) Vi ( ur) and cov (z;) can 
i=1 


be estimated by (z; — 1; )(Z; — 7). 
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Note that model-based parameter estimate covariance will be affected by how the scale parameter 
is handled, but the robust parameter estimate covariance will not be affected by the estimate of 
the scale parameter because @ is cancelled in different terms. 


Parameter Estimate Correlation 


Parameter estimate correlation is calculated as described in GZLM. For more information, see the 
topic “Parameter Estimate Covariance Matrix, Correlation Matrix and Standard Errors”. 


Parameter Estimate Standard Error 


Parameter estimate standard errors are calculated as described in GZLM. There is no standard 
error for the scale parameter in GEE. For more information, see the topic “Parameter Estimate 
Covariance Matrix, Correlation Matrix and Standard Errors”. 


Wald Confidence Intervals 


Wald confidence intervals are calculated as described in GZLM. For more information, see the 
topic “Confidence Intervals”. 


Chi-Square Statistics 


The chi-square statistics and corresponding p-values are calculated as described in GZLM. For 
more information, see the topic “Chi-Square Statistics”. 


Model Testing 


Since GEE is not a likelihood-based method of estimation, the inferences based on likelihoods 
are not possible for GEEs. Most notably, the Lagrange multiplier test, goodness-of-fit tests, and 
omnibus tests are invalid and will not be offered. 


Default tests of model effects are as in GZLM. For more information, see the topic “Default 
Tests of Model Effects”. 


Estimated marginal means are as in GZLM. For more information, see the topic “Estimated 
Marginal Means”. 


Goodness of Fit 


None of the goodness-of-fit statistics which are available for GZLM are valid for GEE. However, 
Pan (2001b) introduced two useful extensions of AIC as goodness-of-fit statistics for model 
selection based on the quasi-likelihood function. One is for working correlation structure selection 
and the other is for variable selection. Both of them are based on the quasi-likelihood function 
under the independence model and the known scale parameter assumptions. 


For the ordinal multinomial model, these goodness of fit statistics are not available because the 
log quasi-likelihood function cannot be derived. 
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Based on the model specification E( ) y =and Var( )y =V (j:) ¢{ahe (log) quasi-likelihood 
function for each case is 


BPy—t 
On (50,0, Y) = a ——d = Fy) 
av (t) 


or 


| By t | 
Q (130,W,y) = ; dt = F(j) — F(y) 
yf 


which we shall call the kernel quasi-likelihood and full quasi-likelihood, respectively. 


Since the components of Y are independent by assumption, the kernel and full quasi-likelihood 
for the complete data is the sum of the individual contributions 


n 
Qr Le O.@ y, ) = S- Ges ( tig} toy, Yi) 
i=1 
and 
Q wo.oyY, j= » Qi (ii; 9.6%, Yi). 
i=1 


Since would depend on £, we change notation from Qi (Ls ©. @, y) to Q;.(B; I) and 
Q (ui &.@,y)to Q(B; where I implies independence assumption. The quasi-likelihood 
functions for the probability distributions are listed in the following table: 
Table 46-26 
Quasi-likelihood functions for probability distributions 
Distribution Qs: (B31) andQ (f; I) 


Normal "fii 

Qe (BI) = Q(B;1) = 0 Fy, — pi)’ } 

i=1 
Inverse Gaussian “.  fwif y 1 
. (B:I) = ae 
Ox (Bi) »» @ (a3 Mi 
Q(B;1) = 3 a a. 
i=1 2oyi mi 

Gamma 


3. 


Distribution Qs (B; 1) andQ (f; 1) 


Negative binomial fiw 


“> tu In (kui) — (yi + 1/k) In (1 + Kyi) } 


c= $A fa) tn +B) 


Poisson "fiw, 
= Qe (B:1) = Yo AA fy tm (us) — ps} 
i=1 
= fiw: Yi 
Q(B) = fil In (#) +(y- wy} 
i=1 ? Hi 
Binomial(m) rr 
oe Qx (B:}) = ew In (ui) + (1 — ys) In (1 — ps)} 


Tweedi Li ee — a)asnt-4% — (1 — g)y274 
ee an (B; 1) = a fiwi { (2 q)Yily ie q) My } 


<= ¢ (1 — q)(2— 4) 


1) WO fi fu t+ (2 -ayini *- (-@ai* 
a(n => Ae Oe} 


Then the quasi-likelihood under the independence model criterion (QIC) for choosing the best 
correlation structure is defined as 


QIC(R) = —2Q (Sr:1) 4 2trace(—H, 1 =r) 
There are three terms in the above formula: 


Q (Gr; 1) is the value of the quasi-likelihood computed using the parameter estimates from the 
model with hypothesized correlation structure R; that is, the estimates of 3,. In evaluating the 
quasi-likelihood, we use /i in place of ,/The scale parameter is unknown in practice, so we have 
to assign a value. If it is set to a fixed value by the user, then that value is used; otherwise 1 is 
used. Note that Q (R;1) could be replaced by (Q;, (Gr; 1). 


H., ; is the generalized Hessian matrix obtained by fitting an independent working correlation 
model. 


&,.r is the robust estimator for parameter estimate covariance from the model with hypothesized 
correlation structure R. 


Under the assumption that all modeling specifications in GEE are correct, 
trace (-H, T =, n) ~ p,, then the above QIC reduces to 


QIC,(R) = —2Q (3r;J) + 2p, 
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So QIC,(R) can be useful for choosing the best subset of covariates for a particular model. For 
the use of QIC and QIC,(R), the model with the smallest value is preferred. Note again that 
Q (3p;1) could be replaced by (;, (3r:1). 


Default Tests of Model Effects 
For type I and III analyses, Wald statistics are still valid. 
Generalized Score Statistics 


For type I and III analyses, the method of constructing a generalized score statistic is the same, 
the only difference is the method of constructing L matrices. A generalized score statistic can be 
computed as follows (when the process is applied to the ordinal multinomial model, all formulae 
should be modified accordingly): 


Calculate 3_; under the constraint L;3 — 0 for each effect i, 


6p 2 B03 1b 0. 


OD 
where L; is a type III test matrix for the ith effect. 
The iterative process to calculate the above optimal 3_, is a combination of sequential quadratic 


programming and GEE parameter estimation. This kind of iterative process will be used here and 
for custom tests, so we will describe the iterative process in a more general form: 


pass) Sis tek 


The iterative process is outlined briefly as follows: 


1 Find initial values 3 with L3(°) = K as described in Section 2.3.4.2-(a). 


2 Compute the working correlation R(a) based on the last iteration’s estimate Av-1), Pearson 
residuals and a specified working correlation structure if (vw — 1+ N,,)/N,, is an integer, 
otherwise working correlation is not updated. 

3 Compute an estimate of the covariance matrix of yiand its generalized inverse 
Vi") = bA,/R(@)A;? and (V\"")) = Sap? R@) AS”. 

Also compute s‘"~!) and H'’—!) based on ge-) and ye ’ as follows: 
= Ou ‘ = 
— = i (v—-1) 
g(e-) = ie) (Vv; ) (yi — fi), 
T 


4 Finda solution of 8\"~') and A\"” for the following KKT equations 


He-) er ae 7 —g(v—1) 
L 0 geet a 0 : 


5. Compute estimates of the vth iteration: 


BC) = pC) + gC) 


& Check if convergence criteria are met. If they are or the maximum number of iterations is reached, 
stop. The final vector of estimates is denoted by 3. Otherwise, go back to step (2). 
Note: the convergence criteria here are similar to those for parameter estimation, except that the 
Hessian convergence criterion is modified as follows: 
T _ 
(se | LTae+0)) (Hm) (se | LTae+1)) < &y 
Compute the generalized estimating equation based on the optimal 3_,. 
Calculate the generalized score statistic for each effect i, 
a A Pont 7 Pa 1 Sint 7 
T a s (4.;) 4 ( a 1 ) 4 s(8<), 
GS 
where &m is the model-based parameter estimate covariance and Xr is the robust parameter 
estimate covariance, each evaluated at 3_;. Then the asymptotic distribution of 75 ,is \;, where 
r is the rank of L;and the p-values can be calculated accordingly. 
Wald Statistics 
For more information, see the topic “Default Tests of Model Effects”. Note Zr (or 2rm) should be 
used as the estimated covariance matrix. 
Scoring 

Predicted values of the linear predictor, estimated standard errors of predicted values of linear 

predictor, predicted values of the means and confidence intervals for the means are calculated. 

For more information, see the topic “Predicted Values”. 
Only two types of residuals are offered as model diagnostics in GEE: raw residuals and Pearson 
residuals. For more information, see the topic “Diagnostics”. 
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GENLOG Multinomial Loglinear 
and Logit Models Algorithms 


This chapter describes the algorithms used to calculate maximum-likelihood estimates for the 
multinomial loglinear model and the multinomial logit model. This algorithm is applicable only to 
aggregated data. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


A Generic categorical independent (explanatory) variable. Its categories are 
indexed by an array of integers. 


B Generic categorical dependent (response) variable. Its categories are 
indexed by an array of integers. 


r Number of categories of B. r=1 

c Number of categories of A. c>1 

Pp Number of nonredundant (nonaliased) parameters. 

i Generic index for the categories of B. i=1,...,r 

J Generic index for the categories of A. j=1,...,c 

k Generic index for the parameters. k=1,...,p 

Nij Observed count in the ith response of B and the jth setting of A. ni; > 0 
Nj Marginal total count at the jth setting of A. Nj = Si) ni; 

N Total observed count. N = ©{_,S)_\ ni; 

Mij Expected count. m;; > 0 

Wij Probability of having an observation in the ith response of B and the jth 


setting of A.0 < mij < Land Yi-y Dimi; =1 
Zig Cell structure value. 


Oj jth normalizing constant. 

Bx. kth nonredundant parameter. 

6B A vector of (1,...p) « 

Lijk An element in the ith row and the kth column of the design matrix for the j 
setting. 


The same notation is used for both loglinear and logit models so that the methods are presented in 
a unified way. Conceptually, one can consider a loglinear model as a special case of a logit model 
where the explanatory variable has only one level (that is, c=1). 


Model 


There are two components in a loglinear model: the random component and the systematic 
component. 
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Random Component 


The random component describes the joint distribution of the counts. 

m The counts {n);,...,7,)} at the jth setting of A have the multinomial 
(Nj, %1),+--,%,r;) distribution. 
The counts n;; and n, , are independent if 7 A j 4 


m = The joint probability distribution of {7;;} is the product of these c independent multinomial 
distributions. The probability density function is 


i=1 
The expected count is E(n;;) = mij = Nj7i;. 


The covariance is 


COV ae Nj Vij (4, _ m3) if j 7 J 
a 3 a 
0 ifj AJ 

where 6,5 = 1 ifa=bandd,, =0 ifa #5. 


Systematic Component 


The systematic component describes the linkage function between the expected counts and the 
parameters. The expected counts are themselves functions of other parameters. Explicitly, for 
i=1...,r and j=1....,c, 


Ln hegere. Ateg SD 
a 0 if 2iy < 0 


where 
Pp 

Vij = ) Ti jk Br 
k=1 


Normalizing Constants 


a; are not considered as parameters, but as normalizing constants. 


N ; 
a; = log sroor ) = disaay¢ 
aj ~ij€ 


Cell Structure Values 


The cell structure values play two roles in loglinear procedures, depending on their signs. If 
zi; > 0, itis a usual weight for the corresponding cell and log (z;;) is sometimes called the 


offset. If =i; < 0, a structural zero is imposed on the cell (B = i, A = j). Contingency tables 
containing at least one structural zero are called incomplete tables. If i; = Obut 2:; > 0, the cell 
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(B =i, A = j) contains a sampling zero. Although a structural zero is still considered part of 
the contingency table, it is not used in fitting the model. Cellwise statistics are not computed 
for structural zeros. 


Maximum-Likelihood Estimation 


The multinomial log-likelihood is 


c r 


L(3) = L(6,,...,9,) = constant + ye Sng log (mj;) 


j=li=l 


Likelihood Equations 


It can be shown that: 


c r 
OL oe nN = 
a = s ys (nij — mij) vijx for k = 1,...,p 


j=li=l 


Let g(3) = (q(3),...., g,(8)) be the (p+1) gradient vector with 


gx(8) = 


: ae me NAG 
The maximum-likelihood estimates 3 = (4 ee 3p ) are regarded as a solution to the vector of 
likelihood equations: 


g(3) =0 


Hessian Matrix 


The likelihood equations are nonlinear functions of B. Solving them for 3 requires an iterative 
method. The Newton-Raphson method is used. It can be shown _ that 


5 c r 
EAGER = -~y Mas ( Lage — Opn) (Vig Fyn) 
j=l i=1 
where 
7 
Oi, = yD Mig Dije j=—l1,...,candk=1,...,p 


1=1 


Let H(,3) be the pxp information matrix, where —H(,3) is the Hessian matrix of the log-likelihood. 
The elements of H(3) are 


ha(@) =—- 72 k=1,...,pandl=1,...,p 


08:08) 


Note:H(,3) is a symmetric positive-definite matrix. The asymptotic covariance matrix of 
Bis estimated by H—-!(;). 
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Newton-Raphson Method 


Let 3‘*) denote the sth approximation for the solution. By the Newton-Raphson method, 
gl(st+l) = gl) 4 H-1(8))¢(8™) 
Define q(3) = H(3)3 + g(3). The kth element of q(3) is 
dr(3) = S- Dae ij (Vijk = Dix) 
j=li=l 
where 


ee rr, t fis —mjij) if 2; > Oandm;; >0 
0 4 otherwise 


Then 


H(g)) 3" +I) — q(3"*’) 


s+1) 


Thus, given 3'*), the (s+1)th approximation 3' is found by solving this systemof equations. 


Initial Values 


3°), which corresponds to a saturated mode, is used as the initial value for 3. Then the initial 
estimates for the expected cell counts are 


0) fnjtA ifaj>o 
~ (0 if z4; <0 


where A > 0 is a constant. 
Note: For saturated models, A is added to nj; if z;; > 0. This is done to avoid numerical problems 
in case some observed counts are 0. We advise users to set A to 0 whenever all observed counts 


(other than structural zeros) are positive. 


The initial values for other quantities are 


(0) : (O) ;_ (0) ee (O) . 
7) — mi; log (m!" / zis ; (ni; —™M;; ) if aU and im; ; +0 
0 otherwise 
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Stopping Criteria 
The following conditions are checked for convergence: 


1) (s) 


(s4 
max;,; ( mi; —™m;; 


jm’) < e provided that m;; > 0 


max;,; ( 1) 


(s+ 
mi; 


: i (2a (3))/0 _ 


The iteration is said to be converged if either conditions 1 and 3 or conditions 2 and 3 are satisfied. 
If p=0, then condition 3 will be automatically satisfied. The iteration is said to be not converged if 
neither pair of conditions is satisfied within the maximum number of iterations. 


Algorithm 


The iteration process uses the following steps: 


1. (0) g(0) (0) 
Calculate m;;' , 9}, ,and nj; . 
2 Set s=0. 
3, (s), (s) 


Calculate H(3‘*)) evaluated at m;; =m; ; calculate q(3‘°)) evaluated at nj; =1;; . 


4. Solve for 3{°*!) . 


(s+1) ) (s+1) 
5. Calculate Up = ER Pie Gy and 
r vw! +1) Ta a +1) : 
mist 1) Nj (zize od /(Etazye uy )) if 245 >0 
u 0 if 21; <0 

6 Check whether the stopping criteria are satisfied. If yes, stop iteration and declare convergence. 

Otherwise continue. 
7. 


Increase s by 1 and check whether the maximum iteration has been reached. If yes, stop iteration 
and declare the process not converged. Otherwise repeat steps 3-7. 


Estimated Normalizing Constants 


The maximum-likelihood estimate for a ; is 


: N . 
a; = log (=); =a Ine wroner c 
aj=1%1j 


where 


P m 
E ; : 
Vig = y LijkPk 
k=1 
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Estimated Cell Counts 
The estimated expected count is 


{ zie" /(Sfyzyje™)) if zij > 0 
: ae 


mij = 


Goodness-of-Fit Statistics 


The Pearson chi-square statistic is 


XP =X 


j=l i=l 
where 
(nij - rinij)” [mij if 24; > 0,ni; > 0, and m,; > 0 
X?, = 4 SYSMIS if 2;; > 0,n;; > 0, and m,; =0 
0 it ay. 00rngy =m; 


2 


If any X;, is system missing, then X° is also system missing. 


The likelihood-ratio chi-square statistic is 


a 2s 3 X?, 


j=li=l 
where 
nij(log (ni; /mij)) if 21; > O,nij > Oand m;; > 0 
c2, _ | SYSMIS if z;; > 0,nj;; > O and mj; =0 
10 if 2;; > 0,nij = 0, and mj; > 0; 


24; <Oorny; = Mi; 
If any G7, is system missing, then G? is also system missing. 
Degrees of Freedom 
The degrees of freedom for each statistic is defined as a = c(r — 1) — p— E, where E is the 
number of cells with z;; < 0 or m;; = 0. 
Significance Level 


The significance level (or the p value) for the Pearson chi-square statistic is Prob(x2 > X) and 
that for the likelihood-ratio chi-square statistic is Prob(x2 > G7). In both cases, x2 is the central 
chi-square distribution with a degrees of freedom. 
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Analysis of Dispersion (Logit Models Only) 


The analysis of dispersion is based on two types of dispersion: entropy and concentration. The 
following definitions are used: 


S(A) 
S(BIA) 
S(B) 
R=S(A)/S(B) 


Dispersion due to the model 
Dispersion due to residuals 
Total dispersion 

Measure of association 


where S(A) + S(B\A) =5(B). Also define 


The bounds are 0 < 7; < 


: ty;log(wy;) if O< 7 
Sij(B|A) = e yi log (a3) : 


Concentration 


D 
wy 
ll 
< 
Re 
| 
43 
>> 
wo 
Se 


pe 
S 
lI 
Me 
a 
a 
| 
% 
Pac 
VE 


Entropy 
S(B) =-NY~ S,(B) 
i=1 
where 
$.(B) = mlog(7;) WfO<m <1 
u ~ 10 if 7; = 
and 
S(B| {)= Nj S;; (BI 1) 
j=l i=1 
where 
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Degrees of Freedom 


Source Measure Degrees of Freedom 
Model S(A) f(r-1) 

Residual S(BIA) (N-—f—-1)(r-1) 
Total S(B) (N —1)(r—1) 


where f equals p minus the number of nonredundant columns (in the design matrix) associated 
with the main effects of the dependent factors. 


Residuals 


Goodness-of-fit statistics provide only broad summaries of how models fit data. The pattern of 
lack of fit is revealed in cell-by-cell comparisons of observed and fitted cell counts. 


Simple Residuals 


The simple residual of the (i,/)th cell is 


ee Le es Mey. At ag SO 
‘I | SYSMIS — if 2; <0 


Standardized Residuals 


The standardized residual for the (i,j)th cell is 


(ni; _ mij)/ J nri; (1 - Mij /'N;) if ij > Oand0< mij < N; 
0 if 24; > Oand ny = mi; 
SYSMIS otherwise 


The standardized residuals are also known as Pearson residuals even though ©“ _,’_, (r;) 7b x2. 
Although the standardized residuals are asymptotically normal, their asymptotic variances are 


less than 1. 


Adjusted Residuals 


The adjusted residual is the simple residual divided by its estimated standard error. Its definition 
and applications first appeared in Haberman (1973) and re-appeared on page 454 of Haberman 
(1979). This statistic for the (i,j)th cell is 


(nij —miz)/ Si if zig > Oand mij > 0 
f= «0 if 24; > Oand njj = mm; 
SYSMIS otherwise 
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P 
fig = Thy (1% — Si rij >, » (sis = i) (xis = i) 


k=1 1=1 


h*! is the (k,)th element of H~! 


7 


3). The adjusted residuals are asymptotically standard normal. 


Deviance Residuals 


Pierce and Schafer (1986) and McCullagh and Nelder (1989) define the signed square root of the 
individual contribution to the G? statistic as the deviance residual. This statistic for the (i,j)th cell is 


D : ~ eZ 
ri = sign(nj; — 175;)\/ di; 


where 
2(njj (log (nj; /1i1;;)) = (ni; = mij)) if 24 j > 0, Mi; > 0, and Nij > 0 
rege 2m; j if 24; > 0, mz; = 0, and nij =0 
a 0 if 2;; > Oandn;; = mij; 
SYSMIS otherwise 


For multinomial sampling, the individual contribution to the G? statistic is only 27; ; log (n;;/m;;), 
vs this is negative when nj; < jj. Thus, an extra term 2(n;; — mM ;) is added to it so that 
> 0 for all i and j. However, we still have ©°_,¥?_, (r: Dy? = G*. 


_ fait 1 


Generalized Residual 


Consider a linear combination of the cell counts © _, ©) dijnij;, where d;; are real numbers. 


The estimated expected value is 
Cc T:. 
S- os di jij 
j=l i=1 
The simple residual for this linear combination is 
Cc 7 
Ps S dss (Tag Dag) 
j=l i=1 
The standardized residual for this linear combination is 


- A 
Lis 1a 14ij(Mij — mij) 


Dd 
ye ne yee NNT 
yz rad (= 105, mij — (Yj _ydijmij) | Nj) 


The adjusted residual for this linear combination is, as given on page 420 of Haberman (1979), 


sco - 
ae PH rij (ity. = G;) 


VV 


where 
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fe m ¢ 1 Tr 2 p ?~p 
V= s- +s d; 7; = S WN, (>: tum - > s: fafir®™ 
N; a 


j=li=l j=l i=1 11=1 


fk = » Oe dij™ij(@ijk — Vik) 


j=li=l 


Generalized Log-Odds Ratio 


Consider a linear combination of the natural logarithm of cell counts 


ss 3 dj; log (mij) 


j=l i=l 
where d;; are real numbers with the restriction 


r 


dg =o j=1,...,¢ 


i=1 


The linear combination is estimated by 


> x dj; log ( mij) = a a dj; log ( Zi) | » dij X ijk Or 
j=li=l j=l i=l j=l i=l k=1 
The variance of the estimate is 

Cc r Pp Pp 
var oe d;; log (mij) J = , ye ww! 

j=li=l k=1 l=1 
where 
wy, = Sa k= bess otf 

j=li=l 

Wald Statistic 

The null hypothesis is 
Ho: yi S- dij log (mj) =0 

j=l i=l 


The Wald statistic is 


(5 1 Dha1 di; log (r43))? 


5) 3) kT 
UP yp wWRwrh 


We 


Under /1o, W asymptotically distributes as a chi-square distribution with 1 degree of freedom. 


The significance level is Prob (4 2M ‘), Note: W will be system missing if the variance of 
the estimate is 0. 


GENLOG Multinomial Loglinear and Logit Models Algorithms 


Asymptotic Confidence Interval 


The asymptotic (1 — a) x 100%confidence interval is 


PP 


Cc <i 
) ) dj; log (mij) + Za/2 ) y wun 


j=li=l k=1 [=1 


where z,, /2 is the upper a/2 point of the standard normal distribution. The default value of « is 
0.05. 


Aggregated Data 


This section shows how data are aggregated for a multinomial distribution. The following 
notation is used in this section: 


Vij Number of cases for B=7 (2 =1,...,r)and A=j (Jj =1,...,c) 
Nijs sth caseweight for B =i and A = j(s =1,...,v;) 

Lije Covariate 

Zijs Cell weight 

Cijs GRESID coefficient 

Cijs GLOR coefficient 

ut Number of positive z;;. (cell weights) for 1 < s < uj; 


yt nt iful. > 

=, J Meseey Mize. Tay > 0 

Nij = ‘fide xs Se 
0 Ivy; =Oorv;, =0 


nt nijs ifnij, > Oand z;;, > 0 

eg = 3 Spe esas : 

‘J 0 ifnijs < 0 and z;;, > 0 

and )}~,<,,,, Means summation over the range of s with the terms z;;, > 0. 
. *3 oy 


The cell weight value is 


y* oy Ines J ts TS 

Li <s<uijMijsrijs/Nij ifn;; > 0 and vj; > 0 

yy - aE i — TS 
acd ea eue tere) Ue ifn;; =O and v;; > 0 
7 . Pp 

0 if Vin =0 

1 ifv;; =0 


If no variable is specified as the cell weight variable, then all cases have unit cell weights by 
default. 


The cell covariate value is 
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1<s<vij MijsCijs/N ifn;; > Oandv,;; > 0 


y 

ijs Cf) 

ey tec y* : ae i — o> 

Vig = 4 Ucs<vj;2 ijs/ Vig ifn;; =O andi i; > 0 
0 if v;, = Oorv;; =0 


The cell GRESID coefficient is 


ye Ho Het fae. Ak apse os as 
MNcs<vj;MijsCijs/ ij if nij > 0 and ij > O 
es OD ; + ifn;; = sts 
Cig = 4 Vics<n,,Cijs/ Vij ifn;; =Oandv;; > 0 
0 if v;; or vij = 0 


There are no defaults for the GRESID coefficients. 


The cell GLOR coefficient is 


vi<s<vi;Mijstijs/Nij ifn;; > Oandv,; > 0 
x —_ yt 2 / 4 : : = fs 4 ee 
Cig = § Li <s<vi,Cijs/Vij ifn;; =Oandv;; > 0 

0 if vii =0orv;; =0 


There are no defaults for the GLOR coefficients. 
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GENLOG Poisson Loglinear Model 
Algorithms 


This chapter describes the algorithm to calculate maximum-likelihood estimates for the Poisson 
loglinear model. This algorithm is applicable only to aggregated data. See “Aggregated Data” 
for producing aggregated data. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


B Generic categorical dependent (response) variable. Its categories are 
indexed by an array of integers. 


r Number of categories of B. r=1 

Pp Number of nonredundant (nonaliased) parameters. 
i Generic index for the category of B. i=1,...,r 

k Generic index for the parameters. k=0,...,p 


ny Observed count in the ith response of B. n;; > 0 
N r 

Total observed count, equal to Se ni. N>O 

i=1 

mM; Expected count. m; > 0 
zi Cell structure value. 
Bx The kth nonredundant parameter. 
B Vector of (80,81, ---,8p). 
Lik An element in the ith row and the kth column of the design matrix. 


m™ Because of the Poisson distribution assumptions, the logit model is not applicable for a 
Poisson distribution. 


m= The Poisson distribution is available in GENLOG only. 


Model 


There are two components in a loglinear model: the random component and the systematic 
component. 


Random Component 


The random component describes the joint distribution of the counts. 
m= The count {7n;} has a Poisson distribution with parameter m;. 
m The counts n; and n ; are independent if ¢ + i. 


m The joint probability distribution of {n;} is the product of these r independent Poisson 
distributions. The probability density function is 
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i=l 


m The expected count is E(n;) = m,;. 


m The covariance is r 
if - / 


sa ’ ) JM t~=1 
nya) = eS 
md . ifiFi 


Systematic Component 


The systematic component describes the linkage function between the expected counts and the 


parameters. The expected counts are themselves functions of parameters. For i = 1,...,r, 
ee ta ze ifz; >0 

; 0 if 2; <0 
where 


Pp 
Y= y Lik Ok 
k=1 


Since there are no constraints on the observed counts, (3) is a free parameter in a Poisson loglinear 
model. 


Cell Structure Values 


Cell structure values play two roles in loglinear procedures, depending on theirsigns. If z; > 0, it 


is a usual weight for the corresponding cell and log (z;) is sometimes called the offset. If z; < 0,a 
structural zero is imposed on the cell (B=i). Contingency tables containing at least one structural 
zero are Called incomplete tables. If n; = 0 but z; > 0, the cell (B=i) contains a sampling zero. 


Although a structural zero is still considered part of the contingency table, it is not used in fitting 
the model. Cellwise statistics are not computed for structural zeros. 


Maximum-Likelihood Estimation 


The multinomial log-likelihood is 


L(G) = L(6o,...,,) = constant 4 ye (n; log (m;) — m,) 
i=l 


Likelihood Equations 


It can be shown that 
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r 


IL = 
een = 5 (ni —Mi)rTin K=I1,....p 


wl 


Let g(3) = (go(3),-+«,9p(8)) be the (p+ 1) x 1 gradient vector with 


2 OL 
gr 3) = 5a 


The maximum-likelihood estimates 3 = ( Ce Bp ) are regarded as a solution to the vector of 


likelihood equations: 


g(3) =0 


Hessian Matrix 


The likelihood equations are nonlinear functions of B. Solving them for 3 requires an iterative 
method. The Newton-Raphson method is used. It can be shown that 


j Tr 
OL _ 
a785 y mM; 
i=1 
Tr 
OL _ _ 
OBQ038, y Mi; Li 
i=1 
Tr 
iy ‘ 
08.080 Mi Lik 
i=1 
is 
a Ato, 
aa; OB) = ; Mili pCi] 
i=1 


Let H(3) be the (p+ 1) x (p+ 1) information matrix, where —H(,3) is the Hessian matrix of 
the log-likelihood. The elements of H(3) are 


ha (8) = p24 =k =0,...,pandl=1,...,p 


O3,05) 


Note:H(,3) is a symmetric positive definite matrix. The asymptotic covariance matrix of ( is 
estimated by —H(,3). 


Newton-Raphson Method 


Let 3‘*) denote the sth approximation for the solution to the vector of likelihood equations. By the 
Newton-Raphson method, 
B&tyD = Be) | H-* (8) g(e) 


Define q(3) = H(3)3 + g(8)- The kth element of (3) is 


: 


qn( 3) = se NiVik 


i=1 
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where 


ee t(n;—m;) if z; >Oandm; >0 
BG otherwise 


H(g&)) ger) =q(B)) 


Thus, given 3'*), the (s+1)th approximation 3‘**!) is found by solving this system of equations. 


Initial Values 


3), which corresponds to a saturated model, is used as the initial value for B. Then the initial 


estimates for the expected cell counts are 


(0) fnit+A ifz,>0 
a ae if 2; <0 


where A > 0 is a constant. 


Note: For saturated models, A is added to n; if z; > 0. This is done to avoid numerical problems 
in case some observed counts are 0. We advise users to set A to 0 whenever all observed counts 
(other than structural zeros) are positive. 


The initial values for 7; are 


(O) (O) , (O) = % 
is m, log (m! / zi) | (n; —m, ) if 2; > 0 and m\”’ >0 
; 0 otherwise 


Stopping 
Criteria 
The following conditions are checked for convergence: 


(s+1) (s) 
max; | }77; —™M; 


jm) < ¢ provided that m; > 0 


(stl 
max; ( mM; —m 


> (eaa@yr< 


The iteration is said to be converged if either conditions 1 and 3 or conditions 2 and 3 are satisfied. 
If p=0, then condition 3 will be automatically satisfied. The iteration is said to be not converged if 
neither pair of conditions is satisfied within the maximum number of iterations. 


Algorithm 


The iteration process uses the following steps: 


(0) 


i 


1. Calculate m\”) and n 
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2. Set s=0. 


Calculate H(,3*)) evaluated at m,; = m‘*); calculate q(3"*)) evaluated at 1; = n!") 


4. Solve for 3{°+!) 


1) ) 


Calculate v{*t”) = DP _ sein Bi ‘) and 

6. mis +1) _ i fey if 71 > 0 

0 if z; <0 

7. Check whether the stopping criteria are satisfied. If yes, stop iteration and declare convergence. 
Otherwise continue. 

8. Increase s by 1 and check whether the maximum iteration has been reached. If yes, stop iteration 
and declare the process not converged. Otherwise repeat steps 3-7. 

Estimated Cell Counts 


The estimated expected count is 


“ — if 2; >0 
mM; = : : 
0 if 2; <0 


where 
PB. ~ 
bj — by Lik Ok 
k=1 


Goodness-of-Fit Statistics 


The Pearson chi-square statistic is 


X?= 5 X? 
i=l 
where 


(n; — mm; )? /mj if z; > 0,n; > 0, andm,; > 0 
X? = 4 SYSMIS if z; > 0,n; > 0, andi; =0 
0 if 2; <0orn; =m; 
If any X? is system missing, then X? is also system missing. 


The likelihood-ratio chi-square statistic is 


: 

+: 2 

G=2)-G; 
i=1 


where 


GENLOG Poisson Loglinear Model Algorithms 


ni(log (n;i/m;)) — (ni — mj) if z; > 0,n; > 0, and m; > 0 


= SYSMIS if z; > 0,nj > 0, and m; = 0 
"  ) om; if z; > 0,n;j = 0, and m,; > 0 
0 if z; < 0orn; =m; 


If any G? is system missing, then G? is also system missing. 


Degrees of Freedom 


The degrees of freedom for each statistic is defined as a = r — 1 — p— E, where E is the number 
of cells with z; < 0m; =.0 


Significance Level 


The significance level (or the p value) for the Pearson chi-square statistic is Prob(x2 > X*) and 
that for the likelihood-ratio chi-square statistic is Prob(x2 > G?). In both cases, x2 is the central 
chi-square distribution with a degrees of freedom. 


Residuals 


Goodness-of-fit statistics provide only broad summaries of how models fit data. The pattern of 
lack of fit is revealed in cell-by-cell comparisons of observed and fitted cell counts. 


Simple Residuals 


The simple residual of the ith cell is 


__fn-m ifz,>0 
" =) SYSMIS_ if 2; <0 


Standardized Residuals 
The standardized residual for the ith cell is 


; (nj; —m,)/Vin,; if 2; > Oand0 < m; 
r= <Q if 2; > Oandn,; =m; 
SYSMIS otherwise 


The standardized residuals are also known as Pearson residuals because ©)’, (7°) = X? when 


1 


all z; > 0. Although the standardized residuals are asymptotically normal, their asymptotic 
variances are less than 1. 


Adjusted Residuals 


The adjusted residual is the simple residual divided by its estimated standard error. This statistic 
for the ith cell is 
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(nj — mi)//m(l —a;j) ifz; >0,n; Am;, and m; >0 


rA=29Q if z; > Oandn; =m; 
SYSMIS otherwise 
where 
P Pp Pp 
aj =m; | A+ 25. vighh? + S- » tinenh™ 
k=1 k=1 l=1 


h*! is the (k,)th element of H~! (4). The adjusted residuals are asymptotically standard normal. 


Deviance Residuals 


Pierce and Schafer (1986) and McCullagh and Nelder (1989) define the signed square root of the 
individual contribution to the G? statistic as the deviance residual. This statistic for the ith cell is 


ve = sign(n; — m;) Vd; 
where 
2(n; (log (7; /77;)) — (nj — mj )) if 2; > O,m,; > 0, andn; > 0 
ies 2m; if z; > 0,m,; > 0, andn; =0 
ane) if z; > Oandn; =m; 
SYSMIS otherwise 
2 


When all 2; > 0, ©7_,(r?)” = G? 


Generalized Residual 
Consider a linear combination of the cell counts ©? ,d;n;, where d; are real numbers. 


The estimated expected value is 


r 


) dyn; 


t=1 
The simple residual for this linear combination is 


r 


Se din; — mi) 


i=1 
The standardized residual for this linear combination is 
Yr di(n; — mj) 


[sor 2, 
Vu dem; 


Using the results in Christensen (1990, p. 227), the adjusted residual for this linear combination is 


ur di(n; — mj) 
VV 
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where 
5 a 4 
V= ye > djdjmj(dij — aij) 
i=1 j= 
r r r 
2, A 
= oe dim; — ) > dj djajjmj 
i=1 i=1 j=1 
where 
P pp 
) / AC 
aij = ij p00 + >, (tie oy ae 5 3 aig jhe 
k=1 k=1 1l=1 


Generalized Log-Odds Ratio 


Consider a linear combination of the natural logarithm of cell counts 


r 


ya log (m;) 


i=1 
where d; are real numbers with the restriction 


r 


Moat 


i=1 
The linear combination is estimated by 


r r 


S— dj log (mm) = > dj log (2) | SON ae 


i=1 i=1 i=1 k=1 


The variance is 


r Pp 
var » d; at) = y Se wwe 


i=1 =1 /l=1 
where 
We = So divix k=1,...,p 
i=1 
Wald Statistic 
The null hypothesis is 
Hy: Sod; log (m;) =0 
i=l 


The Wald statistic is 


(X/_, dj log (1;))? 


5 EI 
ey wWewih 


We 
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Under the null hypothesis, the statistic has asymptotic chi-square distribution with 1 degree 
of freedom. The significance level is Prob( 4 a Ww). Note: W will be system missing if the 


variance is 0. 
Asymptotic Confidence Interval 
The asymptotic (1 —a) x 100% confidence interval is 


Sod; log (m;) = Za/2 


i=1 


where 2,, /2 is the upper a/2 point of the standard normal distribution. The default value of « is 
0.05. 


Aggregated Data (Poisson) 


This section shows how data are aggregated for a Poisson distribution. The following notation is 
used in this section: 


Vi Number of cases for B = i(i=1,...,1r) 

Nis sth caseweight for B = i(s = 1,...,v;) 

Lis Covariate 

Zis Cell weight 

Cis GRESID coefficient 

Cis GLOR coefficient 

vu? Number of positive z;, (cell weights) for 1 < s < v; 


ye F epg 
n; = ie if v; >0 
te : 
0 if v; = 0 or v; =0 


n= Nis if nis > 0 and zj5 > 0 
"is 10 ifnis < Oand zis > 0 


and ©};~,~,, Means summation over the range of s with the terms z;, > 0. 
The cell weight value is 


ae af Me ifn; > 0 and vu; >0 
y ifn; =Oandv; >0 
(es if vi =0 
1 if vu; =0 
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If no variable is specified as the cell weight variable, then all cases have unit cell weights by 
default. 


The cell covariate value is 


ni ao s/n; ifn; > Oandv; > 0 
Lig = vi<s<v,Lis/Y; ifn; =Oandu, >0 
0 if vt =Oorv; =0 


The cell GRESID coefficient is 


Vicsen,micia/n: ifn; > and»; >0 
a vik ui; Cie / U; if n; = 0 and uf >0 
0 if vf =O ory; =0 


There are no defaults for the GRESID coefficients. 


The cell GLOR coefficient is 


Li <s<u,;Nislis/ Mi ifn; > Oandv; > 0 
= ¢ Vicsey,Cis/U; ifn; =Oandv; >0 
0 ifv,, =Oorv; =0 


There are no defaults for the GLOR coefficients. 
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GLM Algorithms 


GLM (general linear model) is a general procedure for analysis of variance and covariance, as 
well as regression. It can be used for both univariate, multivariate, and repeated measures designs. 
Algorithms that apply only to repeated measures are in “Repeated Measures ”. 

For information on post hoc tests, see Post Hoc Tests. For sums of squares, see Sums of 
Squares. For distribution functions, see Distribution and Special Functions.For Box’s M test, 
see Box’s M Test. 


Notation 


The following notation is used throughout this chapter. Unless otherwise stated, all vectors are 
column vectors and all quantities are known. 


n Number of cases. 

N Effective sample size. 

Pp Number of parameters (including the constant, if it exists) in the model. 

r Number of dependent variables in the model. 

Y nxr matrix of dependent variables. The rows are the cases and the columns 
are the dependent variables. The ith row is ¥ i=1,...,n. 

Xx nxp design matrix. The rows are the cases and the columns are the 
parameters. The ithrow is x ,, i=1,...,n. 

rx Number of nonredundant columns in the design matrix. Also the rank of 
the design matrix. 

wy Regression weight of the ith case. 

fi Frequency weight of the ith case. 

B unknown parameter matrix. The columns are the dependent variables. The 
jth column is bj, j=1,...,r. 

x rxr unknown common multiplier of the covariance matrix of any row of Y. 


The (i, j)th element is 7;;, i=1,...,r, j=1,...,r. 


Model 


The model is Y = XB and y’, is independently distributed as a p-dimensional normal distribution 
with mean x ;B and covariance matrix w, |. The ith case is ignored if w; < 0. 


Frequency Weight and Total Sample Size 


The frequency weight /; is the number of replications represented by a case in IBM® SPSS® 
Statistics; therefore, the weight must be a non-negative integer. It is computed by rounding the 
value in the weight variable to the nearest integer. The total sample size is N = 1", f;I(w; > 0), 
where (w; > 0) =1if w,; > 0 and is equal to 0 otherwise. 
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The Cross-Product and Sums-of-Squares Matrices 
To prepare for the SWEEP operation, an augmented row vector of length (p + 7°) is formed: 
zi = (x's y's) 
Then the (p+) x (p+) matrix is computed: 
Z WZ =? _ fiwitid i 
This matrix is partitioned as 
YWX YWY 


WZ = & wx van) 


The upper left pxp submatrix is XK’ WX and the lower right rxr submatrix is Y’WY. 


Sweep Operation 


Three important matrices, G, B, and S, are obtained by sweeping the Z’WZ matrix as follows: 


1. Sweep sequentially the first p rows and the first p columns of Z’ WZ, starting from the first 
row and the first column. 


2. After the pth row and the pth column are swept, the resulting matrix is 
—G B 
BS 


where G is a pxp symmetric g> generalized inverse of X’WX, B is the pxr matrix of parameter 
estimates and S is the rxr symmetric matrix of sums of squares and cross products of residuals. 


The SWEEP routine is adapted from Algorithm AS 178 by Clarke (1982) and Remarks R78 by 
Ridout and Cobby (1989). 


Residual Covariance Matrix 


The estimated rxr covariance matrix is 3) = S/(.N —rx) provided ry < N. Ifrx = N, then 
“= 0. Ifrx > N, then all elements of © are system missing. The residual degrees of freedom is 
N-—ryx. Ifrx > N, then the degrees of freedom is system missing. 


Lack of Fit 
Source of Variation Sum of Squares df 
Lack of fit yr nilG, ~%)" Nu — Dp 


Pure error Sei ys (viz — J; ? N—-ny 


44i=1 4 j=1 
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Where n,, is the number of unique combinations of observed predictor values and n,; is the number 
of cases with the ith combination. 


Mean squares are calculated by dividing each sum of squares by its degrees of freedom. 


The F ratio for testing lack of fit is the ratio of the Lack of fit mean squares to the Pure error 
mean squares. 


The significance level is obtained from the F distribution with n,, — p and NV — n,, degrees of 
freedom. 


Parameter Estimates 


Let the elements of S) be a;;, the elements of G, gjj, and the elements of B, bj Then var (4:;) is 


estimated by 0; ;9:; for i=1,...,p, j=1,...,r and cov (6:3, brs) is estimated by 7;, gi, for i, r=1,...,p, j, 
s=1,...,r. 


Standard Error of the Estimate 


se(i;) — af F595 Gi 


When the ith parameter is redundant, the standard error is system missing. 


The t Statistic 
For testing Hg : b;; = 0 versus Hy : b;; 4 0, the t statistic is 


pes b, ;/se (i:;) if the standard error is positive 
SYSMIS otherwise 


The significance value for this statistic is 2(1 — CDF.T((|t|,.\ — r,.)) where CDF.T is the IBM® 
SPSS® Statistics function for the cumulative t distribution. 


Partial Eta Squared Statistic 
be, VE (02, +(N—ryx) var (d:;)) if rx < N and the denominator is positive 


= 1 if ry = N but bi; #0 
SYSMIS otherwise 


The value should be within 0 < 7? < 1. 


Noncentrality Parameter 


c= |t| 
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Observed Power 
1 — NCDF.T (t., N — rx,c) + NCDF.T (-t., N-—rx,c) ry <N 
_ rx 2N, 
P= ) SYSMIS or any arguments to NCDF.T 


or IDF.T are SYSMIS 


where t, = IDF. T(1 — N —7,) and a is the user-specified chance of Type I error 


(0 <a@<_1).NCDET and IDE.T are the IBM® SPSS® Statistics functions for the cumulative 
noncentral t distribution and for the inverse cumulative t distribution, respectively. 


The default value is ~@ = 0.05. The observed power should be within 0 < p < 1. 


Confidence Interval 
For the p% level, the individual univariate confidence interval for the parameter is 


b; Hie t.se(bi;) 


where ty, = IDF.T(0.5(1 + p/100), N — 1,) for i=1,...,n,j=1,...,r. The default value of p is 
95 (0<p<100). 


Correlation 


b...6..) = 2 Fis9ir! (se (d:;) Ps se(b,.)) if the standard errors are positive 
corr | 217) Prs ‘ 
- SYSMIS otherwise 


for i, r=1,...,n, j, S=1,...,r. 


Estimated Marginal Means 


Estimated marginal means (EMMEANS) are computed as the generic 1’ Bm expression with 
appropriate I and m vectors. | is a column vector of length p and m is a column vector of length r. 
Since the I vector is chosen to be always estimable, the quantity 1'Bm is in fact the estimated 
modified marginal means (Searle, Speed, and Milliken, 1980). When covariates (or products of 
covariates) are present in the effects, the overall means of the covariates (or products of covariates) 
are used in the I matrix. Suppose X and Y are covariates and they appear as X*Y in an effect; then 
the mean of X*Y is used instead of the product of the mean of X and the mean of Y. 


L Matrix 


For each level combination of the between subjects factors in TABLES, identify the nonmissing 
cases with positive caseweights and positive regression weights which are associated with the 
current level combination. Suppose the cases are classified by three between-subjects factors: 
A, B and C. Now A and B are specified in TABLES and the current level combination is A=1 
and B=2. A case in the cell A=1, B=2, and C=3 is associated with the current level combination, 
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whereas a case in the cell A=1, B=3 and C=3 is not. Compute the average of the design matrix 
rows corresponding to these cases. 

If an effect contains a covariate, then its parameters which belong to the current level 
combination are equal to the mean of the covariate, and are equal to 0 otherwise. Using the above 
example, for effect A*X where X is a covariate, the parameter [A=1]*X belongs to the current 
level combination where the parameter [A=2]*X does not. If the effect contains a product of 
covariates, then the mean of the product is applied. 

The result is the 1 vector for the current between-subjects factor level combination. When none 
of the between-subjects effects contain covariates, the vector always forms an estimable function. 
Otherwise, a non-estimable function may occur, depending on the data. 


M Matrix 


The M matrix is formed as a series of Kronecker products 


M=I.2A,2--: @ Ay 

where 

R=. I,,, if the kth within subjects factor is specified in TABLES 
a) by 1/r;,,)1,, otherwise 


with 1,., a column vector of length r;, and all of its elements equal to 1. 


If OVERALL or only between-subjects factors are specified in TABLES, then A; = (1/r;,)1,, for 
k=1,...,t. 


The column for a particular within-subjects factor level combination, denoted by m, is extracted 
accordingly from this M matrix. 


Standard Error 
se(1 Bm) ee eae Gl)(m Sm) if N-rx >0 
SYSMIS otherwise 
Since | are coefficients of an estimable function, the standard error is the same for any generalized 
inverse G. 
Significance 


The t statistic is 


Z 1 Bm/se (1Bm) if se(1 Bm) >0 
SYSMIS otherwise 


If the t statistic is not system missing, then the significance is computed based on a t distribution 
with N — rx degrees of freedom. 
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Pairwise Comparison 


The levels of a between-subjects or within-subjects factor can be compared pair-by-pair. For 
example, a factor with 3 levels produces 3 pairwise comparisons: 1 vs. 2, 1 vs. 3, and 2 vs. 3. 


Between-Subjects Factor 


Suppose the I vectors are indexed by the level of the between-subjects factor as 1;, i, 

iy =1,...,, ands =1,...,b where n, is the number of levels of between-subjects factor s and 
b is the number of between-subjects factors specified inside TABLES. The difference in estimated 
marginal means of level i, and level i , of between-subjects factor s at fixed levels of other 
between-subjects factors is 


, 


er oc ee eee i Bm for i,,i’', =1,..., 131. 47's 
The standard error of the difference is computed by substituting for 1 in (1): 
| ee =a tehy tee Li ciest 1,0 arte bd 'yisreicg aB* 


Within-Subjects Factor 


Suppose the m vectors are indexed by level of the within-subjects factor as mj, ,....;,,; 

js =1,...,, and s =1,...,w, where n, is the number of levels of within-subjects factor s 
and w is the number of within-subjects factors specified inside TABLES. The difference in 
estimated marginal means of level j, and level j , of within-subjects factor s at fixed levels 
of other within-subjects factors is 


, _ of S ie 
IB(m),...; cw oT re ere ya MS is-td eis \ si) for JssJ 5 = 1, wees Mt Is FJ s 


The standard error of the difference is computed by substituting for m in (1) 


m ip ~My i s 


Vijcteshan dita tadisesy ee eh me eee 


Confidence Interval 
The (1 — a) x 100% confidence interval is: 
1Bm + tia /2:N—ry X $e (1 Bm) 


and ¢1_4/2:N_ rg is the (1 —a/2) x 100% percentile of at distribution with V — rx degrees of 
freedom. No confidence interval is computed if N — rx < 0. 


Saved Values 


Temporary variables can be added to the working data file. These include predicted values, 
residuals, and diagnostics. 
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Predicted 
Values 


The nxr matrix of predicted values is Y = XB. The ith row of Y is vy, = x’,B, i=1...,n. Let the 
elements of Y be ij and the elements of XGX?’ be 7;;. 


The standard error of #;; is 


se(¥ij) = V6 55 Ti for i=1,...,n, j=1,...,7 


The weighted predicted value of the ith case is Jwiy;. 


Residuals 
The nxr matrix of residuals is E — Y — Y 
The ith row of E is e'; = y ; aie 


J 7? 


i=1,...,n. 
Let the elements of E be é i; then 
Gi = Big. — Diy for i=1,..4n, Jal. 


The weighted residual is /Jwie;, 


Deleted Residuals (PRESS Residuals) 


The deleted residual is the predicted residual for the ith case that results from omitting the ith 
case from estimation. It is: 


DRESID _ f €ij/(1/wi — mii) if wi > 0 and wit <1; 
‘! | SYSMIS otherwise 


for i=1,...,n, j=1,...,r. 


Standardized Residuals 


The standardized residual is the residual divided by the standard deviation of data: 


ZRESID,, — J (vii — 6i3)/ (2551 wi) if w, > 0 
4 SYSMIS otherwise 


Studentized Residuals 


The standard error for ¢;,; is 


eae of Bah (1/wi-mi) if w; > Oand wr; < 1; 
ora SYSMIS otherwise 


for i=1,...,n, j=1,...,r. The Studentized residual is the residual divided by the standard error 
of the residual. 
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J éij/se(éij) fw; > Oand se(é;;) > 0 
ieee =~ ome otherwise 


Diagnostics 


The following diagnostic statistics are available. 


Cook’s Distance 


Cook’s Distance D measures the change to the solution that results from omitting each observation. 
The formula is 


2 
D;;= Cig ( Tig 1 
im) Vaii( wi—Ti:) (l/wi—mii) J rx 


for i=1,...,n, j=1,...,r. This formula is equivalent to 


Diz = (€:;/se (€:;) )*( se (wij) /s8e(€:;))/rx provided w; > 0 and se(é;;) > 0 


When w; < 0 or se (¢;;) = 0, Dij is system missing 


Leverage 


The leverage for the ith case (i=1,...,n) for all dependent variables is 


LEVER _ J iT if w, > 0 
“| SYSMIS _ otherwise 


Hypothesis Testing 


Let L be an xp known matrix, M be an rxm known matrix and K be an xm known matrix. 
The test hypotheses Hg : LBM = K versus Hj : LBM # K are testable if and only if LB is 
estimable. 


The following results apply to testable hypotheses only. Nontestable hypotheses are excluded. 


: "7 nA! R 
The hypothesis SSCP matrix is Su = (LBM ~ K) (LGL ) (LBM = K )and the error 
SSCP matrix is Sp = MSM. 


Four test statistics, based on the eigenvalues of S;,'S y, are available: Wilks’ lambda, 
Hotelling-Lawley trace, Pillai’s trace, and Roy’s largest root. 


Let the eigenvalues of $;1S, be A, > + => A, 2 0 and A .,Am = 0, and let rg = 


rank(Sg);5 = min(l,rg) ;Me =n — 13m" = 5 (rg — 1] — 1); n" =F (ne — Te — 1). 


TE+1?"* 
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Wilks’ Lambda 


> ; 1 
A det(Se) = 
AS det(SH+Se) Il (1 + Ax) 
k=1 


When Hp is true, the F statistic 


Fa Gra) (1-Al/7) 


Irg Al/t 


follows asymptotically an F distribution, where 


cae Se = 14 1) 
v = Z(Irg — 2) 
al VCE DIP +rE—5) if(P+r_ —5) > 0 
1 otherwise 


The degrees of freedom are (/r.,47 — 2u). The F statistic is exact if s=1,2. See Rao (1951) and 
Section 8c.5 of Rao (1973) for details. 


The eta-squared statistic is 77 =1-—A!/*. 
The noncentrality parameter is \ = (€7 — 2v)7?/(1 — 77). 
The power is 1 — NCDF.F(F,,/r2, (€7 — 2u), A) where Fg is the upper 1000 percentage point 


of the central F distribution, and a is user-specified on the ALPHA keyword on the CRITERIA 
subcommand. 


Hotelling-Lawley Trace 
In IBM® SPSS® Statistics, the name Hotelling-Lawley trace is shortened to Hotelling’s trace 
T = trace(Sz'SH) = UP Ak 
When Hp is true, the F statistic 


F= 2(sn°>+1) T 


— s(Qm*4+s41) 5 


follows asymptotically an F distribution with degrees of freedom (s(2m* + s+ 1),2(sn* + 1)) 
The F statistic is exact if s=1. 


The eta-squared statistic is 1? = (T/s)/(T/s + 1). 
The noncentrality parameter is \ = 2(sn* + 1)n?/(1—17). 
The power is 1 — NCDF.F(F,,s(2m* + s + 1),2(sn* + 1),A) where Fy is the upper 100a 


percentage point of the central F distribution, and a is user-specified on the ALPHA keyword on 
the CRITERIA subcommand. 
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Pillai’s Trace 
V =trace(Su(Su + Sr) *) = EL, e/(1 +e) 
When Hp is true, the F statistic 


Fo (2n'+s+1) Vv 


™ (2m*+s+1)(s—V} 


follows asymptotically an F distribution with degrees of freedom (s(2m* +s +1),s(2n*+s8+1)) 
The F statistic is exact if s=1. 


The eta-squared statistic is 7? = V/s. 
The noncentrality parameter is \ = s(2n* + s + 1)n?/(1—1?). 
The power is 1 — NCDF.F(Fg, s(2™m* + s + 1),s(2n* + s+ 1),A) where Fy is the upper 1000 
percentage point of the central F distribution, and a is user-specified on the ALPHA keyword on 
the CRITERIA subcommand. 

Roy’s Largest Root 
(o) _ Ay 
which is the largest eigenvalue of S5'Sur When Hp is true, the F statistic is 
F=O(n,e-—wt+ry)/w 


where w = max (/,7) is an upper bound of F that yields a lower bound on the significance level. 
The degrees of freedom are (w,n, — w+ 1). The F statistic is exact if s=1. 


The eta-squared statistic is 77 = 0/(1+ 0). 
The noncentrality parameter is \ = (n- — w+ ry)n?/(1—17). 


The power is 1 — NCDF.F(F,,.n,. —w +1, \) where Fg is the upper 100a percentage point of 
the central F' distribution, and o is user-specified on the ALPHA keyword on the CRITERIA 
subcommand. 


Individual Univariate Test 
F = 5 SHil’ _i=1,..,m 


i/(n—rx) 


where S;;.; and $,;.; are the ith diagonal elements of the matrices S;, and Sy respectively. Under 
the null hypothesis, the F statistic has an F distribution with degrees of freedom (/,n — rx). 


The eta-squared statistic is 1? = S47.;/(Si; + Se) 


The noncentrality parameter is \ = (n —rx)S1.;/Sx.i- 
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The power is 1 — NCDF.F(F,,,1,n — rx, A) where Fy is the upper 100a percentage point of 


the central F distribution, and o is user-specified on the ALPHA keyword on the CRITERIA 
subcommand. 


Bartlett’s Test of Sphericity 


Bartlett’s test of sphericity is printed when the Residual SSCP matrix is requested. 


Hypotheses 


In Bartlett’s test of sphericity the null hypothesis is H, : “= oI, versus the altemative 
hypothesis H, : © ¢ oI, where o? > 0 is unspecified and I, is an rxr identity matrix. 


Likelihood Ratio Test Statistic 


=| rae if trace(A) > 0 


(trace(A)/r)"" 


SYSMIS iftrace(A) <0 


, 


where A = (Y = xB) w(y = xB) is the rxr matrix of residual sums of squares and cross 
products. 


Chi-Square Approximation 


Define WW = \7/". When n is large and under the null hypothesis that for n —r. > 1andr > 2, 


Pr(—p(n —rx)logW < ec) = Pr (x3 < c) t we (Pr (3, rs °) —Pr (x3 < c)) t O(n-*) 
where 


f=r(r+1)/2-1 


p=1- (2r? +r +2)/(6r(n—rx)) 
, | 


(r4+2)(r—1)(r—2)(2r?+6r?+3r42) 


Wy = _ 5 
“ 288r2(n—rx )* p? 


Chi-Square Statistic 


_ f-p(n-rx)logW ifW>o0 
©= ) SYSMIS otherwise 


Degrees of Freedom 
f=r(r+1)/2-1 
Significance 


1 — CDF.CHISQ(c, f) — w2(CDF.CHISQ(c, f + 4) — CDF.CHISQ(c, /)) 
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where CDF.CHISQ is the IBM® SPSS® Statistics function for the cumulative chi-square 
distribution. The significance is reset to zero whenever the computed value is less than zero 
due to floating point imprecision. 


Custom Hypothesis Tests 


The TEST subcommand offers custom hypothesis tests. The hypothesis term is any effect 
specified (either explicitly or implicitly) in the DESIGN subcommand. The error term can be a 
linear combination of effects that are specified in the DESIGN subcommand or a sum of squares 
with specified degrees of freedom. The TEST subcommand is available only for univariate 
analysis; therefore, an F statistic is computed. When the error term is a linear combination 

of effects and no value for degrees of freedom is specified, the error degrees of freedom is 
approximated by the Satterthwaite (1946) method. 


Notation 


The following notation is used in this section: 


RY Number of effects in the linear combination 

qs Coefficient of the sth effect in the linear combination, s=1.,....S. 

ls Degrees of freedom of the sth effect in the linear combination, s=1.....S. 
MS, Mean square of the sth effect in the linear combination, s=1.....S. 

QO Linear combination of effects 

la Degrees of freedom of the linear combination 

MSa Mean square of the linear combination 


Error Mean Square 


If the error term is a linear combination of effects, the error mean square is 


S 
MS, = S- q. x MS, 


s=1 


If the user supplied the mean squares, MS,is equal to the number specified after the keyword VS. 
If MS < 0, the custom error term is invalid, and MS gis equal to the system-missing value and 
an error message is issued. 


Degrees of Freedom 
If MSg > 0 and the user did not supply the error degrees of freedom, then the error degrees of 


freedom is approximated using the Satterthwaite (1946) method. Define 


a 7H, ifls >0 
d.= , 
otherwise 
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8 
Then D = a d,. The approximate error degrees of freedom is 


s=1 


4 


1, -{(MSe)"/D ifD>0 
®~) SYSMIS _ otherwise 


If MSq = 0 and the user supplied the error degrees of freedom, /<) is equal to the number 
following the keyword DF. If! < 0, the custom degrees of freedom is invalid. In this case, /g is 
equal to the system-missing value and an error message is issued. 


F Statistic 
The null hypothesis is that all parameters of the hypothesis effect are zero. The F statistic is used 


for testing this null hypothesis. Suppose the mean square and the degrees of freedom of the 
hypothesis effect are MS ;,; and /,; then the F statistic is 


MSe@ 


p= { 8: if MS > 0 and MSy > 0 
SYSMIS _ otherwise 


Significance 


oe _ J1-cprF(Fila,!Q) if la >0,lQ > 0 and F # SYSMIS 
significance = ) sysmIs otherwise 


where CDE.F is the IBM® SPSS® Statistics function for the F cumulative distribution function. 


Univariate Mixed Model 


This section describes the algorithms pertaining to a random effects model. GLM offers mixed 
model analysis only for univariate models—that is, for r=1. 


Notation 


The following notation is used throughout this section. Unless otherwise stated, all vectors are 
column vectors and all quantities are known. 


k Number of random effects. 

Po Number of parameters in the fixed effects. 

Pi Number of parameters in the ith random effect, i=1,...,k. 

o Unknown variance of the ith random effect, ¢? > 0, i=1,...,k. 
o Unknown variance of the residual term, 72 > 0. 

X; The n x p; design matrix, i=0,1,...,k. 


Bo The length pp vector of parameters of the fixed effects. 
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Model 


Bi The length p; vector of parameters of the ith random effect, i=1,...,k. 


L The sxp full row rank matrix. The rows are estimable functions. s>1 


Relationships between these symbols and those defined at the beginning of the chapter are: 


B p=potpit-+: +pr 
m X=[(Xo|X,|... |X] 
5o 
By 
ga B= 
Br 


The mixed model is represented, following Rao (1973), as 


k 
Y = Xoo + om Xi8:+e 
i=l 
The random vectors 3,,...,(3;, and e are assumed to be jointly independent. Moreover, the 


random vector 3; is distributed as N,,,(0,071,,) for i=1,...,k and the residual vector eis distributed 
as N,,(0,027W~'). Thus, 


oO 


k 
cou(Y) = So? XiX i } ow! 
i=1 


Expected Mean Squares 


For the estimable function L, the expected hypothesis sum of squares is 
E(SS,) = E(Y'wta,wiy) 


k 
= By X pW2A,W? Xoo + yo a? trace (x',W?A,W?X;) + oa? trace(Ay ) 


i=l 


where 

1 ’ \71 , 1 
A, =W!XGL'(LGL’) LGx'wi 

: ; i \-lL 

Since L = LGX WX, trace(A,) =s and xX W2A,;,W?X=L (LGL ) L. The matrix 
X W2A, W2xX can therefore be computed in the following way: 
Compute an sxs upper triangular matrix U such that U U = LGL by the Choleskydecomposition. 
Invert the matrix U to give U~'. 


Compute C = L'U~!, 
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Now we have X W?2A,; W2X = CC. If the rows of C are partitioned into the same-size 
submatrices as those contained in X—that is, 


Co 
ea!" 
C, 
where C; is a p; x s submatrix—then X',W2A,W2X, = C,C ,, i=0,1,...,k. 
Since trace (c, C2) is equal to the sum of squares of the elements in C,, denoted by SSQ(C;), 


the matrices CC’; need not be formed. The preferred computational formula for the expected 
sum of squares is 


E($S1) = ByCoCo ho + 1}, 9? SSQ(C;) + 80? 


Lu 


Finally the expected mean square is 


E(MS_) = 4E(SS;) 


s 


For the residual term, the expected residual mean square is: E(\/SE) =o. 


Note: GLM does not compute the term 180'CoC 0h , but reports the fixed effects whose 
corresponding row block in Cp contains nonzero elements. 


Hypothesis Test in Mixed Models 


Suppose .\/S;, is the mean square for the effect whose estimable function is L, and s,; is the 
associated degrees of freedom. The F statistic for testing this effect is 


MS 
_ L 
i= MSpcy 


where ./S-,;,) is the mean square of the error term with sj; degrees of freedom. 


Null Hypothesis Expected Mean Squares 
If the effect being tested is a fixed effect, its expected mean square is 
E(MS,) =02+cjo2 + +--+ +cno7 + Q(L) 


where «|... ,¢:;, are coefficients and ()( 1) is a quadratic term involving the fixed effects. Under 
the null hypothesis, it is assumed that ()() = 0. Although the quadratic term may involve effects 
that are unrelated to the effect being tested, such effects are assumed to be zero in order to draw a 
correct inference for the effect being tested. Therefore, under the null hypothesis, the expected 
mean square is 


E(MS,) = 02+ cop + +++ +x; 


If the effect being tested is a random effect, say the jth(1 < j< ‘:) random effect, its expected 
mean square is = 
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E(MS_,) =o? C107 ree Chop 


Under the null hypothesis o% = 0; hence, the expected mean square is 


E(MS_,) =024 S- Co? 


Error Mean Squares 


Let 7S; be the mean square of the ith(i =1,...,/:)random effect. Let s; be the corresponding 
degrees of freedom. The error term is then found as a linear combination of the expected mean 
squares of the random effects: 


MS er) = OMS, + +--+ +apMSp + di MSE 
such that 
E(MSx1)) = a E(MS1) + +++ + ae E(MSk) + qe41E(MSE) = 02 + cof + +++ + cnoR 


If s,; =0(1 <i < kk) then g; =0. 
The error degrees of freedom is computed using the Satterthwaite (1946) method: 
(MSe(2))” 


SRE = a, 
EL) S- (q¢;MS;)"/s; 


L<i<kh;s;>0 


If the design is balanced, the above F statistic is approximately distributed as an F distribution 
with degrees of freedom (87, $11. :) under the null hypothesis. The statistic is exact when only 
one random effect is used as the error term—that is, g;, = 1 and g; = 0 fori ¥ ip. If the design is 
not balanced, the above approximation may not be valid (even when only one random effect is 
used as the error term) because the hypothesis term and the error term may not be independent. 


Repeated Measures 


The GLM (general linear model) procedure provides analysis of variance when the same 
measurement or measurements are made several times on each subject or case (repeated 
measures). The algorithms in this section apply solely to repeated measures designs. 


Notation 


The notation used in “GLM Algorithms” is used here. Additional conventions are defined below: 


t The number of within-subjects factors. 
c The number of measures. 
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Tk The number of levels of the kthwithin-subjects factor. r, > 2,k=1,...,t¢ 


M;. The contrast matrix of the kth within-subjects factor, k = 1,...,t. It is 
a square matrix with dimension 7;,. Each element in the first column is 
usually equal to 1/7. For a polynomial contrast each element is 1 /\/7x, or, 
for a user-specified contrast, a non-zero constant The other columns have 
zero column sums. 


Number of Variables 


t 
It is required that ¢ x II r};. = r, the number of dependent variables in the model. 
k=1 


Covariance Structure 


As usual in GLM, the data matrix is related to the parameter matrix B as Y = XB + E. The rows 
of E are uncorrelated and the ith row has the distribution NV, (0,w; 1») . Repeated measures 
analysis has two additional assumptions: 


@ Y=Ye@d, & --- @»d, where Yc is the covariance matrix of the measures and_ is the 
Kronecker product operator. 

m The Huynh and Feldt (1970) condition: Suppose ov.) is the (r,s)-th element of 
U.(k =1,...,¢); then ok) + o(*) — 9g") — constant for r # s. Matrices satisfying this 
condition result in orthonormally transformed variables with spherical covariance matrices; 
for this reason, the assumption is sometimes referred to as the sphericity assumption. A 
matrix that has the property of compound symmetry (that is, identical diagonal elements and 
identical off-diagonal elements) automatically satisfies this assumption. 


Tests on the Between-Subjects Effects 
The procedure for testing the hypothesis of no between-subjects effects uses the following steps: 


1. Compute M = I. © My, © --- © Mi. where Mx.1 is the first column of the contrast matrix Mx 
of the kth within-subjects factors. Note that M is an rxc matrix. 


2. For each of the between-subjects effects including the intercept, get the L matrix, according to 
the specified type of sum of squares. 


3 3. .e ; ; 
Compute SH = (LBM) (LGL ) (LBM) and Sz = MSM_ Both are cxc matrices. 


4. Compute the four multivariate test statistics: Wilks’ lambda, Pillai’s trace, Hotelling-Lawley 
trace, Roy’s largest root, and the corresponding significance levels. Also compute the individual 
univariate F statistics. 


5. Repeat steps 2 to 4 until all between-subjects effects have been tested. 
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Multivariate Tests on the Within-Subjects Effects 


The procedure for testing the hypothesis of no within-subjects effects uses the following steps: 


1. For the kth within-subjects factor, compute M =I. @ A, @ --- @ A; where Ay, = My.9.,, which 
is the second-to-last column of Mx when the kth within-subjects factor is involved in the effect. 
Otherwise, A;. = M,,.,; . Note that M is an rxcd matrix, where d is the number of columns in the 
Kronecker product A, @... A,- In general, d> 1. 


2. For each of the between-subjects effects, get the L matrix, according to the specified type of 
sum of squares. 


3. : iy, : ; 
Compute Su = (LBM) (LGL ) (LBM) and Se = MSM_ Both are cdxcd matrices. 


4. Compute the four multivariate test statistics: Wilks’ lambda, Pillai’s trace, Hotelling-Lawley 
trace, Roy’s largest root, and the corresponding significance levels. Also compute the individual 
univariate F statistics. 


5. Repeat steps 2 to 4 for the next between-subjects effect. When all the between-subjects effects 
are used, go to step 6. 


6 Repeat steps 1 to 5 until all within-subjects effects have been tested. 


Averaged Tests on the Within-Subjects Effects 


The procedure for the averaged test of the hypothesis of no within-subjects effects uses the 
following steps: 


1. Take My (k = 1, ...,¢) the equally spaced polynomial contrast matrix. 


2. Compute M =I, © A; © --- @ A; where Ay = M;..2.,, which is the 2nd to last column of My 
when the kth within-subjects factor is involved in the effect. Otherwise, Ay = 1,../\/r;, . Note tha 
M is an rxcd matrix, where d is the number of columns in the Kronecker product A, ©... ~ At. 


In general, d> 1. 


3. For each of the between-subjects effects, get the L matrix, according to the specified type of 
sum of squares. 


4. Compute S,, = (LBM)'(LGL’)(LBM) and S; = M’SM. Both are cdXxcd matrices. 


Partition §,,; into ¢? block matrices each of dimension dxd. The (k,/)th block, denoted as S 7.;,.;, 
(k=1,...,c and [=1,...,c), is a sub-matrix of Sy, from row (/ — |)d + 1 to row kd, and from column 
(2 — 1)d 4+ 1to column Id. Form the cxc matrix, denoted by §,,, whose (k, /)th element is the trace 
of Sy... The matrix Sz is obtained similarly. 


6. Use S; and S» for computing the four multivariate test statistics: Wilks’ lambda, Pillai’s trace, 
Hotelling-Lawley trace, Roy’s largest root, and the corresponding significance levels. Note: Set 
the degrees of freedom for S;, (same as the row dimension of L in the test procedure) equal to dry, 
and that for Sy equal to q(;, — rx) in the computations. Also compute the individual univariate F 
statistics and their significance levels. 
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7. Repeat steps 3 to 6 for each between-subjects effect. When all the between-subjects effects 
are used, go to step 8. 


8. Repeat steps 2 to 7 until all within-subjects effects have been tested. 


Adjustments to Degrees of Freedom of the F Statistics 


The adjustments to degrees of freedom of the univariate F test statistics are the Greenhouse-Geisser 
epsilon, the Huynh-Feldt epsilon, and the lower-bound epsilon. 


For any of the three epsilons, the adjusted significance level is 
1— CDF.F(Fyedry, ed(n — rx)) 
where ¢ is one of the three epsilons. 


Greenhouse-Geisser epsilon 


3 (trace(Sp))” 
CAA == ea 
cS dxtrace (Sz) 


Huynh-Feldt epsilon 


Cyr = min (ties 1) 


d(n—rx )—d2€aq? 
Lower bound epsilon 


ELB= 1/d 


Mauchly’s Test of Sphericity 


Mauchly’s test of sphericity is displayed for every repeated measures model. 


Hypotheses 


In Mauchly’s test of sphericity the null hypothesis is H,, : MSM = o7I,,, versus thealternative 
hypothesis H, : MSM #o°I,,, where c?> 0 is unspecified, I is an mxm identity matrix, and 
M is the rxm orthonormal matrix associated with a within-subjects effect. M is generated using 
equally spaced polynomial contrasts applied to the within-subjects factors (see the descriptions in 
“Averaged Tests on the Within-Subjects Effects”). 


Mauchly’s W Statistic 


w= { Tamera if trace(=) > 0 
SYSMIS if trace(=) <0 
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where = = M AM and A = (Y 7 XB) w(y - xB) is the rxr matrix of residual sums of 
squares and cross products. 
Chi-Square Approximation 
When n is large and under the null hypothesis that for n — rx. > 1andm > 2, 
Pr(—p(n—rx)logW <c) = Pr (x3 < c) + wy (Pr O94. < c) —Pr (x3 < c)) +O(n-*) 
where 


f=m(m+1)/2-1 
p=1-(2m?+m-4 2) /(6m(n —rx)) 


(m+2)(m—1)(m 2)(2m3 6m72+3m 2) 
wo = 


288m2(n—rx )? p2 
Chi-Square Statistic 


_ f-p(n-rx)logW if W>0 
= ) SYSMIS otherwise 


Degrees of Freedom 
f=m(m-+1)/2-1 
Significance 
1- CDF.CHISQ(c, f) — w2(CDF.CHISQ(c, f +4) — CDF.CHISQ(<, f)) 
where CDF.CHISQ is the IBM® SPSS® Statistics function for cumulative chi-square distribution. 


The significance will be reset to zero in case the computed value is less than zero due to floating 
point imprecision. 
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HILOGLINEAR Algorithms 


HILOGLINEAR fits hierarchical loglinear models to multidimensional contingency tables using an 
iterative proportional-fitting algorithm. 


The Model Minimum Configuration 


Consider an J x J x Jx table. Let njj;,be the observed frequency and mj; the expected frequency 
for cell (i, j, k). A simple way to construct a saturated linear model in the natural logarithms of the 
expected cell frequencies is by analogy with analysis of variance (ANOVA) models: 


B C {B 1c BC 


a an = A } ABC 
Lijx = log (mijn) =utuf +ue tug tu 7 TUR UR, FUG 


a ijk 


where 1 <i<J,1<j7<J,and1<k< K. In general, each of the seven subscripted u-terms 
sums to zero over each lettered subscript. 


It can be shown (Bishop, Feinberg, and Holland, 1975), p. 65, that, under the commonly 
encountered sampling plans, the log-likelihood is 


®+ Nut ” nyu + J N4j+U + *" naspug + y Nij4 usP+ 
tJ 
* Be 
) Nithu 4c f 2 Ny jkUjp 4 ) Nijk UR” 


i,k ijk 


where ® is independent of any parameters and N is total number of observations. Also, the n-terms 
adjacent to the unknown parameters are the sufficient statistics. The formulation of the above log 
likelihood is based on the saturated model. When we consider unsaturated models, terms drop 
out and those that remain give the sufficient statistics. For instance, if we assume that there is no 


three-factor effect, that is, Uj, al = 0 for all i, j, and k, or more briefly w;;; = 0, then 
log (mijn) = ut us | uP { ut L 4 1? | Tg ; we 
and nj. pis js. is phe tijy+. mi, and rn, ;;, are the sufficient statistics for this reduced model. 


These statistics can be considered as tables of sums configurations and denoted by C with proper 
subscripts. For oan {nj)4} is the configuration C2 and {n,,;,} is the configuration C’3, 
Note that {n;,,},{n,j4}, and {n, ,;,} can be obtained from {n;;.},{ni,,} and {n, ;,}. We then 
call the last aie configurations Cj», C3 and C’2;minimal configurations or minimal statistics. 


Notation for Unlimited Number of Dimensions 


To generalize results, we denote the complete set of subscripts by a single symbol 6 Thus, ng is 
the observed frequency in an elementary cell and wy is the cell weight. We add a subscript to 

6 to denote a reduced dimensionality so that ng; is the observed sum in a cell of Cy; We use the 
second subscript, i, solely to distinguish between different configurations. 
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Iterative Proportional Fitting Procedure (IPFP) 


We can obtain MLEs for the elementary cells under any hierarchical model by iterative fitting of 
the minimal sufficient configurations. To illustrate the algorithm, we consider the unsaturated 
model. The MLEs must fit the configurations C2, C3 and (23. The basic IPFP chooses an 
initial table Mash and then sequentially adjusts the preliminary estimates to fit C')2,C 3 and C23. 
Fitting to C')2 gives 


~(1) (O) ngje 
Me ik = Mi jk ac 


Subsequent fitting to Ci; gives 


,{?) — 92,1) nite 
n aR Ns ik nie ) 
ith 


and similarly, after fitting C2; we have 


~(3) _ ~ (2) nije 

ijk = Mik me 

We repeat this three-step cycle until convergence to the desired accuracy is attained. The extension 
of the above procedure to the general procedure for fitting s configurations is straightforward. Let 
the minimal configurations be Cs; for i=1,...,s, with cell entriesn,,;, respectively. The procedure is 
as follows: 


Initial Cell Estimates 


To start the iterations, if CWEIGHT is not specified, set 


(0) ‘ ifny > 0 


t] = < 
d 0 otherwise 


If CWEIGHT is specified, set 


1 if CWRIGHT > 1 
my = 4 0 if CW BIGHT <0 
CWEIGHT if0<CWEIGHTS1 


Intermediate Computations 


After obtaining the initial cell estimates, the algorithm proceeds to fit each of these configurations 
in turn. After r cycles, the relations are 


(sr+za) ~ (sr+i—1 ne Sta Tes oe ah. 
Mp = 1ihy )_" for <i < sir >0 


i 
me 


Convergence Criteria 


The computations stop either when a complete cycle, which consists of s steps, does not cause any 
cell to change by more than a preset amount ¢ that is, 
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(sr). (sr—s 
thy) — ng” <e for all 6 


or the number of cycles is larger than a preset integer max. Both ¢ and max can be specified. The 
default for ¢ is 


€ = maxg { 0.25; 4e8 


and the default for max is 20. 


Goodness of Fit Tests 


The Pearson chi-square statistic is 


- Z 

2 (np — me) 

v= yo 
he 


0 


and the likelihood-ratio chi-square statistic is 
i= 2. ng In (np/me) 
6 


where the first summation is done over the cells with nonzero estimated cell frequencies while the 
second summation is done over cells with positive observed and estimated cell frequencies. The 
degrees of freedom for the above two statistics are computed as follows: 


Adjusted Degrees of Freedom 


Let 7. be the total number of the cells and P the number of parameters in the model. Also, let 
z, be the number of cells such that 74 = 0. The adjusted degrees of freedom is 


adjusted df = T. — P — x, 


Unadjusted Degrees of Freedom 


unadjusted df = T.. — P 


Parameter Estimates and Standard Errors 


If a saturated model is fitted and neither ny + 6 nor wy is equal to zero for all cells, then the 
parameter estimates and their standard errors will be computed. Each estimate of the parameters 
in the saturated model can be expressed as a linear combination of the logarithms of the observed 
cell frequencies plus user-specified 6, where the coefficients used in the linear combination 

add to zero. We discuss the rule of obtaining the coefficients. Consider, in general case, a 

J, x Jg x ... x Jyy frequency table with defining variables X,, ...,X),. Let ine? " denote 
an L-term interaction involving. X,,, ..., V., at level j,i, ..-,j.. respectively. Denote A as a 
vector that is constructed in the way that its nonzero components correspond to the variables in the 
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parameter to be estimated and are set to the level of the variable. Let C’;, 
with components equal to cell IDs. That is, 


j,, be a M-dim vector 


Cj, jay JM =(h; -e9JM)i 1<7 < J 1; 1<i<M 


The coefficient (3, is determined through the comparison of the components of A and 


pone JM 


C'j,.....j,.+ Let s be the number of nonzero components of A that do not match (equal) the 
corresponding components of C’;, .._;,,. Also, let matching occur at component /), ...,7;.. Then 
the coefficient for cell (j;, ...,j,7) is 

Be coy = (Ey aD ee (G1) 

The estimate 1 ae ig of us Lae a is then 


Se), oe , : 
ie a y Bj, 502 M(Mj,,...500 + 9) 


For a large sample, the estimate approximately follows a normal distribution with the above mean 
and variance if the sampling model follows a Poisson, multinomial, or product-multinomial 
distribution. The confidence interval for the parameter can be computed based on the asymptotic 
normality. 


Residuals 


The following residuals are computed. 


Raw Residuals 


raw residual = ng — rig 


Standardized Residuals 
standardized residual = (ng — rng) / Vimo 


where 7719 must be greater than 0. 


Partial Associations and Partial Chi-squares 


Partial associations of effects can be requested when a saturated model is specified. Let \?(/:) be 
the chi-square for the model that contains the effects up to and including the k-interaction terms. 
The test of the significance of the kth-order interaction can be based on 


x?(k — 1) — x?(k) 
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Degrees of freedom are obtained by subtracting the degrees of freedom for the corresponding 
models. 


Model Selection Via Backward Elimination 


The selection process starts with the model specified (either via DESIGN or MAXORDER 
subcommand). The partial chi-square is calculated for every term in the generating class. Any 
term with zero partial chi-square is deleted, then the effect with the largest observed significance 
level for the change in chi-square is deleted, provided the significance level is larger than 0.05, 
the default. With the removal of a highest-order term, a new model with new generating class is 
generated. The above process of removing a term is repeated for the new model and is continued 
until no remaining terms in the model can be deleted. 
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HOMALS Algorithms 


The iterative HOMALS algorithm is a modernized version of Guttman (1941). The treatment of 
missing values, described below, is based on setting weights in the loss function equal to zero, and 
was first described in De Leeuw and Van Rijckevorsel (1980). Other possibilities do exist and can 
be accomplished by recoding the data (Gifi, 1981; Meulman, 1982). 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Number of cases (objects) 
Number of variables 
D Number of dimensions 
For variable j; 7 = 1, ...,m 
hj n-vector with categorical observations 
kj Number of valid categories (distinct values) of variable j 
G; Indicator matrix for variable j, of order n x k; 
; 1 when the ¢th object is in therth category of variable 7 
Jj)ir =) 9 when the ith object is not in the rth category of variable j 
D; Diagonal matrix, containing the univariate marginals; that is, the column 
sums of G; 
M; Binary diagonal nxn matrix, with diagonal elements defined as 
1 when the ith observation is within the range {1, /: 
G)" =) @ when the ath observation outside the range [1, /:)] 


The quantification matrices and parameter vectors are: 


XxX Object scores, of order nxp 
Y; Category quantifications, of order k; x p. 
Y Concatenated category quantification matrices, of order jk; x p 


Note: The matrices G;, M;, and Dj; are exclusively notational devices; they are stored in reduced 
form, and the program fully profits from their sparseness by replacing matrix multiplications 
with selective accumulation. 


Objective Function Optimization 


The HOMALS objective is to find object scores X and a set of Y; (for 7 = 1, ...,7™) so that 
the function 
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o(X;¥) = L/myjtr((X - GY) Mj(K - G)Y))) 


is minimal, under the normalization restriction X'M,X = mnI, where the matrix M, = ©;M,, 
and I is the pxp identity matrix. The inclusion of M; in o(X; Y) ensures that there is no influence 
of data values outside the range [1, /:;], which may be really missing or merely regarded as such; 
M., contains the number of “active” data values for each object. The object scores are also 
centered; that is, they satisfy u M..X = 0, with u denoting an n-vector with ones. 


Optimization is achieved through the following iteration scheme: 
1. Initialization 
2. Update object scores 
3. Orthonormalization 
4. Update category quantifications 
5. Convergence test: repeat steps 2-4 or continue 


6. Rotation 


These steps are explained below. 


Initialization 
The object scores X are initialized with random numbers, which are normalized so that 
uM .X = 0 and XM .X = mul, yielding X. Then the first category quantifications are obtained 
as ¥; =D; 1G ;X. 
Update object scores 
First the auxiliary score matrix Z is computed as 
Z+ 5jM;,G;Y; 
and centered with respect to M,: 
Z« {M. - (M.uu’M./u'M.u) }Z. 


These two steps yield locally the best updates when there are no orthogonality constraints. 


Orthonormalization 


The orthonormalization problem is to find an M..-orthonormal X* that is closest to Z in the least 
squares sense. In HOMALS, this is done by setting 


Xt ml °M,/?GRAM (Mz! 2) 
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which is equal to the genuine least squares estimate up to a rotation. The notation GRAM( ) is 
used to denote the Gram-Schmidt transformation (Bjérk and Golub, 1973). 


Update category quantifications 


For j=1,...,m, the new category quantifications are computed as: 


Convergence test 


The difference between consecutive loss function values ¢({X:Y )—o(X!: Y* ) is compared with 
the user-specified convergence criterion ¢ —a small positive number. Steps 2 to 4 are repeated as 
long as the loss difference exceeds e. 


Rotation 


As indicated in step 3, during iteration the orientation of X and Y with respect to the coordinate 
system is not necessarily correct; this also reflects that o(X;Y) is invariant under simultaneous 
rotations of X and Y. From theory it is known that solutions in different dimensionality should 
be nested; that is, the p-dimensional solution should be equal to the first p columns of the (p+1)- 
dimensional solution. Nestedness is achieved by computing the eigenvectors of the matrix 
1/m¥;Y ;D; Y,. The corresponding eigenvalues are printed after the convergence message 

of the program. The calculation involves tridiagonalization with Householder transformations 
followed by the implicit QL algorithm (Wilkinson, 1965). 


Diagnostics 


The following diagnostics are available. 


Maximum Rank (may be issued as a warning when exceeded) 


The maximum rank pmax indicates the maximum number of dimensions that can be computed 
for any dataset. In general: 


Pmax = min {(n — 1), ((2jk;) — max (m4, 1))} 


where my is the number of variables with no missing values. Although the number of nontrivial 
dimensions may be less than pmax when m=2, HOMALS does allow dimensionalities all the 
way up to Pmax. 
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Marginal Frequencies 
The frequencies table gives the univariate marginals and the number of missing values (that is, 


values that are regarded as out of range for the current analysis) for each variable. These are 
computed as the column sums of D; and the total sum of Mj. 


Discrimination Measure 
These are the dimensionwise variances of the quantified variables. For variable j and dimension s: 


~ / , 
Vis =Y (j)sDiY(j)s/ nm 


where y, ;), is the sth column of Y ;, corresponding to the sth quantified variable Gj y; ;).. 


Eigenvalues 


The computation of the eigenvalues that are reported after convergence is discussed in step 6. With 
the HISTORY option, the sum of the eigenvalues is reported during iteration under the heading 
“total fit.” Due to the fact that the sum of the eigenvalues is equal to the trace of the original matrix, 
the sum can be computed as 1 /m;, Nig The value of o(X; Y) is equal to p — 1/mbj¥, Ni os 
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This procedure estimates the survival function for time to occurrence of an event. Some of the 
times may be “censored” in that the event does not occur during the observation period, or contact 
is lost with participants (loss to follow-up). 

If the subjects are divided into treatment groups, KM produces a survival function for each 
treatment group (factor level) and a test of equality of the survival functions across treatment 
groups. The survival functions across treatment groups can also be compared while controlling for 
categories of a stratification variable. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Pp Number of levels (strata) for the stratification variable 
g Number of levels (treatment groups) for the factor variable 


Estimation and SE for Survival Distribution 


Suppose that for a given combination of the stratification and factor variables, a random sample 
of n individuals yields a sample with k distinct observed failure times (uncensored). Let 

ty < ... < ¢, represent the observed life times and 7), be the largest observation in the sample. 
(Note that 7; = ¢;. if the largest observation is uncensored.) Define 


n; = Number of subjects who are at risk at time ¢; 
d; = Number of failures (deaths) at ¢; 
A; = Number of censorings in interval [ t;,¢;, 1) 


Note that 

nno=n a 

Ni+p=n; —d; —A;,1=0,1,...,k-1 
to=9 

thi] = Oo 

= 

20 = 9 

Ao = 


Note that 
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tr 1) — S(t )(1 ~ “| 


Note: When n; = d;.(Ty = ty, and A, = 0), S(t{) =0 and var (S(t; )) =0. 


Estimation of Mean Survival Time and Standard Error 
S> S(t) (tin: — ti) if Ty = t, 


S(t} )(tis1 —t:) + S(ti)(Tr —ty) — otherwise 
i=0 


The variance of the mean survival time is 


k 9 
asd; 
var(ft) = ) aaah ear ) 


i=1 : 
k—-1 : 
ai =) > S(ti) (tui - ti) S(t; )(Ti — tk) 
lai 
d= > a; 
i=1 


unless there are both censored and uncensored occurrences of the largest survival time. In that case, 


k-1 2 
d asd; 


zr n(n; — di) 


ay = Sse \(tist —t)) 


l= 


The standard error is the square root of the variance. 
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Plot 
S 


The following plots are available. 


Survival Functions versus Time 


The survival function S(t) is plotted against t. 


Log Survival Functions versus Time 


In (81 t )) is plotted against t. 


Cumulative Hazard Functions versus Time 


—ln (s (t )) is plotted against t. 


Estimation of Percentiles and Standard Error 


100p percentile of the survival time, where p is between 0 and 1, is computed as 


(= inf {til ($(ti) <p)} 


The asymptotic variance of t,, is estimated by 


: _ var(S(t,)) 
var(ty) = “Clty 
where f (t,,) is computed as 


ftp) — S(upso 05) —S(tp—o.05) 


tp-o 05—Up +0.05 


where wu, = sup4 til ($(ti) > a) i 


Testing the Equality of the Survival Functions 


Three statistics are computed to test the equality of survival distributions in the presence of 
arbitrary right censorship. These statistics are the logrank (Mantel-Cox), the modified Wilcoxon 
test statistic (Breslow), and an alternative test statistic proposed by Tarone and Ware (1977). 
Using the regression model proposed by Cox (1972), all three test statistics have been modified 
for testing monotonic trend in hazard functions. 


Test Statistics 
Let n‘*) be the number of subjects in stratum s. Let 


(s) _ ~ 4s) 
to <<... Stim! 
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be the observed failure times (responses) and 


ny = in stratum s the number of individuals in group | at risk just prior to t'*’ 


d;, = number of deaths at (*) in group | 


and 
g 
(s) (s) 
l; ey dj, 
l=1 
(s) _ (s) 
ni = Nii 
I=1 


Hence, the expected number of events in group | at time f,”’ is given by 


(s) 

Ej; = ni? 

Define 

= (s) r(s) 

GAs es) 

with 
Ms 

UO = Sow (di FP) fort =1,...,9-1 
i=l 


Also, let V, be a(g — 1) x (g — 1) covariance matrix with 


(s) (s) (s) (s) (s) 
(s) Ms (s) ad; (n! — d; \ nf’ A ni; : 
V5; =>> (eu! ) t—( d5t a for j,J1=1,...,g—1 


where 


w,*) = 1 for log-rank test 


i = ny” for Breslow test 


f 


V n\*) for Tarone Ware test 


and 


Pees emer 
jt 0 otherwise 


Define 


P 
U — S° Og 
s=1 


and 
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The test statistic for the equality of the g survival functions is defined by 
x2 =U VU 


\° has an asymptotic chi-square distribution with g—1 degrees of freedom. 


Test Statistic for 
Trend 


Let 


, 


t= (th, -.-,8) 


be a vector with ¢; = trend weighting coefficient for group. Form the vector 


rh e73) (3) 
aye (UY ate ) 


U,) differs from U, only in the last component. 
Let V‘*) bea g x g matrix with element ag for 1 < l,j < g. The test statistic is defined by 


Xt = Vt 
where 
P 
U= SU. 
s=1 
P 
aD A, 
s=1 


The logrank, Breslow, and Tarone Ware tests may involve trend. Each of the test statistics has a 
chi-square distribution with one degree of freedom. 


The default trend is defined as follows: 

GAGoDiiehe L138 AGH): Teg even 
= ( (g~1) 1,0,1, v7 get) otherwise 
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KNN Algorithms 


Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other 
cases. In machine learning, it was developed as a way to recognize patterns of data without 
requiring an exact match to any stored patterns, or cases. Similar cases are near each other and 
dissimilar cases are distant from each other. Thus, the distance between two cases is a measure 
of their dissimilarity. 


Cases that are near each other are said to be “neighbors.” When a new case (holdout) is presented, 
its distance from each of the cases in the model is computed. The classifications of the most 
similar cases — the nearest neighbors — are tallied and the new case is placed into the category that 
contains the greatest number of nearest neighbors. 


You can specify the number of nearest neighbors to examine; this value is called k. The pictures 
show how a new case would be classified using two different values of k. When k = 5, the new 
case is placed in category 1 because a majority of the nearest neighbors belong to category 1. 
However, when k = 9, the new case is placed in category 0 because a majority of the nearest 
neighbors belong to category 0. 


Nearest neighbor analysis can also be used to compute values for a continuous target. In this 


situation, the average or median target value of the nearest neighbors is used to obtain the 
predicted value for the new case. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y Optional 1xN vector of responses with element y,,, where n=1,...,N 
indexes the cases. 

x0 P°xN matrix of features with element «’),,,, where p=1,...,P° indexes the 
features and n=1.,...,N indexes the cases. 

x PxN matrix of encoded features with element ,,,,, where p=1,...,P 
indexes the features and n=1,...,N indexes the cases. 

P Dimensionality of the feature space; the number of continuous features 
plus the number of categories across all categorical features. 

N Total number of cases. 

Nj,j =1,2,---,J The number of cases with Y = j, where Y is a response variable with 
J categories 

N j The number of cases which belong to class j and are correctly classified 

. as j. 
N : The total number of cases which are classified as j. 
Preprocessing 


Features are coded to account for differences in measurement scale. 
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Continuous 


Continuous features are optionally coded using adjusted normalization: 


2 (apn, — min (rp) 
"en Tax (2°) — min (79) 


where «,,,, is the normalized value of input feature p for case n, ic is the original value of the 
feature for case n, min (.\)) is the minimum value of the feature for all training cases, and 
max (cr?) is the maximum value for all training cases. 


Categorical 


Categorical features are always temporarily recoded using one-of-c coding. If a feature has 
c categories, then it is is stored as c vectors, with the first category denoted (1,0,...,0), the next 
category (0,1,0,...,0), ..., and the final category (0,0,...,0,1). 


Training 


Training a nearest neighbor model involves computing the distances between cases based upon 
their values in the feature set. The nearest neighbors to a given case have the smallest distances 
from that case. The distance metric, choice of number of nearest neighbors, and choice of the 
feature set have the following options. 


Distance Metric 


We use one of the following metrics to measure the similarity of query cases and their nearest 
neighbors. 


Euclidean Distance. The distance between two cases is the square root of the sum, over all 
dimensions, of the weighted squared differences between the values for the cases. 


P 


2 
Euclidean; = pe Wp) (Dep = Dipin) 
p=1 


City Block Distance. The distance between two cases is the sum, over all dimensions, of the 
weighted absolute differences between the values for the cases. 


P 
CityBlockin = ~ Wp) |f(~pyi — Z(p)h 
p=l 
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The feature weight w(p) is equal to 1 when feature importance is not used to weight distances; 
otherwise, it is equal to the normalized feature importance: 


See “Output Statistics ” for the computation of feature importance F'J(,,). 


Crossvalidation for Selection of k 


Cross validation is used for automatic selection of the number of nearest neighbors, between a 
minimum f;,,,;,, and maximum f',,,x. Suppose that the training set has a cross validation variable 
with the integer values 1,2,..., V. Then the cross validation algorithm is as follows: 


> For each . € [ming Kmax] compute the average error rate or sum-of square error of k: 
CV; = Doe €v/V) where e,, is the error rate or sum-of square error when we apply the Nearest 
Neighbor model to make predictions on the cases with X = v; that is, when we use the other 
cases as the training dataset. 


> Select the optimal k as: & = arg{min CV; : Kinin < & < kmax} 


Note: If multiple values of k are tied on the lowest average error, we select the smallest k among 
those that are tied. 


Feature Selection 


Feature selection is based on the wrapper approach of Cunningham and Delany (2007) and uses 
forward selection which starts from Jp,;ceq features which are entered into the model. Further 
features are chosen sequentially; the chosen feature at each step is the one that causes the largest 
decrease in the error rate or sum-of squares error. 


Let 5; represent the set of J features that are currently chosen to be included, 55 represents the 
set of remaining features and ¢ represents the error rate or sum-of-squares error associated 
with the model based on S; . 


The algorithm is as follows: 


> Start with J = Jpy,ceq features. 


orcec 


> For each feature in S% , fit the k nearest neighbor model with this feature plus the existing features 
in S; and calculate the error rate or sum-of square error for each model. The feature in 55 whose 
model has the smallest error rate or sum-of square error is the one to be added to create S$, .;. 


> Check the selected stopping criterion. If satisfied, stop and report the chosen feature subset. 
Otherwise, J=J+1 and go back to the previous step. 


Note: the set of encoded features associated with a categorical predictor are considered and added 
together as a set for the purpose of feature selection. 
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Stopping Criteria 
One of two stopping criteria can be applied to the feature selection algorithm. 


Fixed number of features. The algorithm adds a fixed number of features, J,gq, in addition to 
those forced into the model. The final feature subset will have Jgaq + Jrorcea features. Jgqq May 
be user-specified or computed automatically; if computed automatically the value is 


of a hm. 50 
Jada = max {min (20, P”) — Jrorcea, O} 


When this is the stopping criterion, the feature selection algorithm stops when J,,77 features 
have been added to the model; that is, when Jagq = .J +1 , stop and report $7, as the chosn 
feature subset. 


Note: if Jaaa = 0, no features are added and S$; with J = Jro;cea is reported as the chosen 
feature subset. 


Change in error rate or sum of squares error. The algorithm stops when the change in the 
absolute error ratio indicates that the model cannot be further improved by adding more features. 
Specifically, if e;,,; =Oore; > ez, and 


Fe es ee 
; < Poin 
EJ 
where A,,,i, is the specified minimum change, stop and report S,;.; as the chosen feature subset. 


If ey < €jy+1 and 


ej — e€j4 : 
pee > 2A min 
J 


stop and report S.; as the chosen feature subset. 
Note: if e; = 0 for J = Jrorcea, no features are added and $7 with J = Jrorcea is reported as 
the chosen feature subset. 

Combined k and Feature Selection 


The following method is used for combined neighbors and features selection. 
1. For each k, use the forward selection method for feature selection. 


2. Select the k, and accompanying feature set, with the lowest error rate or the lowest sum-of-squares 
error. 


Output Statistics 


The following statistics are available. 
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Percent correct for class j 


J x 100% 
Nj 


Overall percent for class j 


NT 


ra x 100% 


Intersection of Overall percent and percent correct 
Jf 
os N; / N x 100 % 
j=1 
Error rate of classification 


J 
1-5 —N,/N } x 100% 
j=l 


Sum-of-Square Error for continuous response 


N 
. \2 
(Un =n) 
n=1 


where y,, is the estimated value of y,,. 


Feature Importance 


Suppose there are X,;), X(2) +++ Xy),) (1 < _m < P®) in the model from the forward selection 
process with the error rate or sum-of-squares error e. The importance of feature X,,,) in the 
model is computed by the following method. 


> Delete the feature X(,,) from the model, make predictions and evaluate the error rate or 


sum-of-squares errore,,,) based on features X(1),X(2)+°*X(p—1), X(peijystt 5 


>~*(m)-s 


> Compute the error ratioe,,,) + +. 


The feature importance of X,,,) is FJ\,,) = e,,) + + 


(p) m 


Scoring 


After we find the k nearest neighbors of a case, we can classify it or predict its response value. 
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Categorical response 


Classify each case by majority vote of its k nearest neighbors among the training cases. 


> If multiple categories are tied on the highest predicted probability, then the tie should be broken by 
choosing the category with largest number of cases in training set. 


> If multiple categories are tied on the largest number of cases in the training set, then choose the 
category with the smallest data value among the tied categories. In this case, categories are 
assumed to be in the ascending sort or lexical order of the data values. 


We can also compute the predicted probability of each category. Suppose /:; is the number of 
cases of the jth category among the k nearest neighbors. Instead of simply estimating the predicted 
probability for the jth category by xi, we apply a Laplace correction as follows: 

kj+1 


k+J 


where J is the number of categories in the training data set. 


The effect of the Laplace correction is to shrink the probability estimates towards to 1/J when the 
number of nearest neighbors is small. In addition, if a query case has k nearest neighbors with the 
same response value, the probability estimates are less than 1 and larger than 0, instead of 1 or 0. 


Continuous response 
Predict each case using the mean or median function. 


Mean function. 


Median function. Suppose that,,,,m € Nearest (n) are the values of the continuous response 
variable, and we arrange y,,,,m € Nearest (n) from the lowest value to the highest value and 
denote them as y(;,) < Ym) < +++ < Yim,» then the median is 

YY it) k is odd 
“ 2 
Yn = YR TYE) 1 


5 k is even 
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Linear modeling algorithms 


Linear models predict a continuous target based on linear relationships between the target and 


one or more predictors. 


For algorithms on enhancing model accuracy, enhancing model stability, or working with very 
large datasets, see “Ensembles Algorithms”. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n 
D 


Pp 


Number of distinct records in the dataset. It is an integer and n > 1. 


Number of parameters (including parameters for dummy variables but 
excluding the intercept) in the model. It is an integer and p = 0. 


Number of non-redundant parameters (excluding the intercept) currently in 
the model. It is an integer and 0 < p* <p. 


Number of non-redundant parameters currently in themodel. p° = p* +1 


Number of effects excluding the intercept. It is an integer and 0 < p° < p 


n x 1 target vector with elements y;. 


n x 1 frequency weight vector. 
n X 1 regression weight vector. 


Effective sample size. It is an integer and N = s- fi. If there is no 
t=1 


frequency weight vector, N=n. 


n x (p+ 1) design matrix with element x;;. The rows represent the records 
and the columns represent the parameters. 


nm X 1 vector of unobserved errors. 


(p +1) x 1 vector of unknown parameters; 3 = (3, 31,---3,). 30 isthe 
intercept. 


(p +1) x 1 vector of parameter estimates. 


(p +1) x 1 vector of standardized parameter estimates. It is the result of a 
sweep operation on matrix R. bo is the standardized estimate of the intercept 
and is equal to 0. 


n X 1 vector of predicted target values. 

Weighted sample mean for Xj, 7 = 1,2,---p 

Weighted sample mean for y. 

Weighted sample covariance between X; and Xj, 7,7 =1,2,---p. 
Weighted sample covariance between X; and y. 

Weighted sample variance for y. 


(p + 1) x (p+ 1) weighted sample correlation matrix for X (excluding the 
intercept, if it exists) and y. 


The resulting matrix after a sweep operation whose elements are 7’;;. 
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Model 


Linear regression has the form 
y=XBte 


where ¢ follows a normal distribution with mean 0 and variance 72D ~!, where 
D7! = diag(1/g...., 1/g,). The elements of ¢ are independent with respect to each other. 


Notes: 
m™ X-can be any combination of continuous and categorical effects. 
= Constant columns in the design matrix are not used in model building. 


m= If n=1 or the target is constant, no model is built. 


Missing values 


Records with missing values are deleted listwise. 


Least squares estimation 


The coefficients are estimated by the least squares (LS) method. First, we transform the model 
by pre-multiplying D'’” as follows: 


D!/?y = D!/?xB 4 D!/7¢ 


so that the new unobserved error D'/*« follows a normal distribution N,, (0,071), where I is an 
identity matrix and D'/* = diag(\/gi,..-,\/gn)- Then the least squares estimates of B can be 
obtained from the following formula 


“ ‘ , yi , 
B=arg at (DVy - px) F (DY?y — p’/x8) 
where F = diag(/i,..., f,,). Note that 
; ; T ’ 
(D'?y —p¥/?x8) F(D/y - Dx) 


= (y —Xs)'D!/2Fp!/?(y — xg) 
= (y — Xs)! W(y — Xs) 


where W = diag(w),..., w,,) = diag(gifi,---,9n fn), 80 the closed form solution of B is 


6 = (xTwx) xwy 


Linear modeling algorithms 


G is computed by applying sweep operations instead of the equation above. In addition, sweep 
operations are applied to the transformed scale of X and y to achieve numerical stability. 
Specifically, we construct the weighted sample correlation matrix R then apply sweep operations 
to it. The R matrix is constructed as follows. 


First, compute weighted sample means, variances and covariances among Xj, Xj, 
i,j =1,...,p.andy: 


Weighted sample means of Xj and y are X; = =T at 


An = 1 an. 
STF wpe, and Y= met — Sd we 
Swe Laka Oke ki Y <a Luka VkyYk> 


Weighted sample covariance for X; and Xj is Si; =<; oy, we (aei— Xian; — X,); 


Weighted sample covariance for X; and y is S;,, = <4 ewe (a = 3X Jur —¥)3 


Weighted sample variance for y is S,,. — = Sa 7)? 


Second, compute weighted sample correlations r;; = —, i,j =1,...,p and y. 
7 V Pura 


Then the matrix R is 


rit 2 le att 
r21 122 "2p 79 
ol 
R— ; roll at 7 
, ;: Rip Roe 
Ppl 2p Ppp PY 12 
Pyl Ty2 Typ = Tyy 


If the sweep operations are repeatedly applied to each row of R,,, where R,, contains the 
predictors in the model at the current step, the result is 


= TpR-! T pR-! 
TRpRiy Ro — RyRy Riv 


The last column R7,'Rj2 contains the standardized coefficient estimates; that is, b = Rj|'Rio. 
Then the coefficient estimates, except the intercept estimate if there is an intercept in the model, 


are: 
Bj = by 
Model selection 


The following model selection methods are supported: 


m= None, in which no selection method is used and effects are force entered into the model. For 
this method, the singularity tolerance is set to le-12 during the sweep operation. 
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m™ Forward stepwise, which starts with no effects in the model and adds and removes effects one 
step at a time until no more can be added or removed according to the stepwise criteria. 


m= Best subsets, which checks “all possible” models, or at least a larger subset of the possible 
models than forward stepwise, to choose the best according to the best subsets criterion. 


Forward stepwise 


The basic idea of the forward stepwise method is to add effects one at a time as long as these 
additions are worthy. After an effect has been added, all effects in the current model are checked 
to see if any of them should be removed. Then the process continues until a stopping criterion 

is met. The traditional criterion for effect entry and removal is based on their F-statistics and 
corresponding p-values, which are compared with some specified entry and removal significance 
levels; however, these statistics may not actually follow an F distribution so the results might be 
questionable. Hence the following additional criteria for effect entry and removal are offered: 


m= Maximum adjusted R2; 
= Minimum corrected Akaike information criterion (AICC); and 


m Minimum average squared error (ASE) over the overfit prevention data 


Candidate statistics 


Some additional notations are needed describe the addition or removal of a continuous effect Xj or 
categorical effect {¥';, yt ,» where ¢ is the number of categories. 


C The number of non-redundant parameters of the eligible effect X; or 
r t 

{Xj.}5=1- 

p The number of non-redundant parameters in the current model (including 
the intercept). 

p The number of non-redundant parameters in the Sea model (including 

: " p° + for entering an effect 

the intercept). Note thatp” =<". ; 

me p — for removing an effect 

SSep The weighted residual sum of squares for the current model. 

SSepse The weighted residual sum of squares for the resulting model after entering 
the effect. 

SSep—s The weighted residual sum of squares for the resulting model after removing 
the effect. 

yy The last diagonal element in the current R matrix. 

Puy The last diagonal element in the resulting R matrix. 


F statistics. The F statistics for entering or removing an effect from the current model are: 


(SSep = SSep+0) fis we cer a Tiju) (N — p’) 
SSepc (NP) Fy XO 


Fenter, = 


(SSep-¢ — SSep)/C* - (Fyy — Tyy)(N — p*) 
SSep/(N — p°) Pyy x € 


F remove; = 
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and their corresponding p-values are: 


Penter; = P( Fy. Np 2 Fontes) aoe P( Fy. n <S Feviter,) 


=p” 
Premove; = P(£e-.N—p = Poeieue; | =1- P( Fe. n—p < By eiionés,) 


Adjusted R-squared. The adjusted R2 value for entering or removing an effect from the current 
model is: 


(N — 1)fyy 


adj.R? =1— 


Corrected Akaike Information Criterion (AICC). The AICC value for entering or removing an 
effect from the current model is: 


dees : CIV 41) Sia Pip 2p" N 
AICC = N1 — a oe 
7 ( N Yodo p’'—1 


Average Squared Error (ASE). The ASE value for entering or removing an effect from the current 
model is: 


si 


ASE = —=— Ww el’ = i)? 
> it t=1 


where jj; = 18 are the predicted values of y; and T is the number of distinct testing cases in 
the overfit prevention set. 


The Selection Process 


There are slight variations in the selection process, depending upon the model selection criterion: 


m The F statistic criterion is to select an effect for entry (removal) with the minimum (maximum) 
p-value and continue doing it until the p-values of all candidates for entry (removal) are equal 
to or greater than (less than) a specified significance level. 


m= The other three criteria are to compare the statistic (adjusted R2, AICC or ASE) of the 
resulting model after entering (removing) an effect with that of the current model. Selection 
stops at a local optimal value (a maximum for the adjusted R2 criterion and a minimum 
for the AICC and ASE). 


The following additional definitions are needed for the selection process: 


FLAG A p* X 1 index vector which records the status of each effect. FLAG; = 

1 means the effect i is in the current model, FLAG; = 0 means it is not. 

|{i| LAG; = 1}| denotes the number of effects with FLAG = 1. 
MAXSTEP The maximum number of iteration steps. The default value is 3 x p*. 
MAXEFFECT The maximum number of effects (excluding intercept if exists). The default 

value is p*. 
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Pin The significance level for effect entry when the F-statistic criterion is used. 
The default is 0.05. 

Pout The significance level for effect removal when the F statistic criterion is 
used. The default is 0.1. 

AF The F statistic change. It is Fenter; OF F-cmove, for entering or removing 
an effect Xj (here Xj could represent continuous or categorical for simpler 
notation). 

PAF The corresponding p-value for AF’. 

MSCcurrent The adjusted R2, AICC, or ASE value for the current model. 


1. Set {F LAG; a , = 0and iter = 0. The initial model is 7 = 7. If the adjusted R*, AICC, or ASE 
criterion is used, compute the statistic for the initial model and denote it as MSCgurrent- 


2. Ff {iJ LAG; = 0} 4 0 iter < MAXSTEP and |{i|F LAG; = 1}| < MAXEFFECT, go to the 
next step; otherwise stop and output the current model . 

3. Based on the current model, for every effect j eligible for entry (see Condition below), 
If FC (the F statistic criterion) is used, compute Fnter; and Penter); 
If MSC (the adjusted R2, AICC, or ASE criterion) is used, compute MSGj. 


4. If FC is used, choose the effect Xj-,j* =argmin; {Penter,; } and if Penter,. < Pin, enter X;- to the 
current model. 


If MSC is used, choose the effect X;-,j* = argmin, {\/SC;} and if MSCj- < MSC currents 
enter X;- to the current model. (For the adjusted R2 criterion, replace min with max and reverse 
the inequality) 


If the inequality is not satisfied, stop and output the current model. 
5. If the model with the new effect is the same as any previously obtained model, stop and output the 


current model; otherwise update the current model by doing the sweep operation on corresponding 
row(s) and column(s) associated with Xj- in the current R matrix. Set FLAG;, = 1 and iter = 


iter + 1. 
If FC is used, let AF = Fenter,. and Par = Penter,-3 
If MSC is used, let AL SCourreny = MSC}-. 


6. For every effect k in the current model; that is, FLAG), = 1, Vk 
If FC is used, compute Fyejove, and Premove; ; 
If MSC is used, compute MSC. 


7. IfFCis used, choose the effect X),-, 4° = argmaxy {Premove, } aNd if Premove,- > Pout remove 
X;,- from the current model. 


If MSC is used, choose the effect X;,-, 4° = argmin, {1/SC;,} and if MSC}. < MSCeurrent 5 
remove X;,.- from the current model. (For the adjusted R2 criterion, replace min with max and 
reverse the inequality) 


If the inequality is met, go to the next step; otherwise go back to step 2. 
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8. If the model with the effect removed is the same as any previously obtained model, stop and 
output the current model; otherwise update the current model by doing the sweep operation 
on corresponding row(s) and column(s) associated with X;- in the current R matrix. Set 
FLAG,;- = and iter = iter + 1. 


If FC is used, let AF = Fre TROVE p= and PAF = Premove ke 5 


If AC is used, let AICCoyrrent = AICC,+. Then go back to step 6. 


Condition. In order for effect j to be eligible for entry into the model, the following conditions 


must be met: 

For continuous a effect Xj, rj; > t; (Cis the singularity tolerance with a value of le—4) 

For categorical effect { X,_}‘ eae ee ere eT he ag? 

where t is the singularity tolerance, and r;; andr). ;,,s = 1,...,/. are diagonal elements in the 


current R matrix (before entering). 
For each continuous effect X; that is currently in the model, 7;,;.¢ < 1. 


For each categorical effect {;,. e , with ¢’ levels that is currently in the model, 


mass Pesky Phahe) beatles p ES 1. 


where 7; and 7;,;..,8 =1,...,¢", are diagonal elements in the resulting R matrix; that is, the 
results after doing the sweep operation on corresponding row(s) and column(s) associated with X, 
or { X;,. we , in the current R matrix. The above condition is imposed so that entry of the effect 
does not reduce the tolerance of other effects already in the model to unacceptable levels. 


Best subsets 


Stepwise methods search fewer combinations of sub-models and rarely select the best one, so 
another option is to check all possible models and select the “best” based upon some criterion. 
The available criteria are the maximum adjusted R2, minimum AICC, and minimum ASE over 
the overfit prevention set. 


Since there are p° free effects, we do an exhaustive search over 2” models, which include 
intercept-only model (4 = ~). Because the number of calculations increases exponentially with 


p’, it is important to have an efficient algorithm for carrying out the necessary computations. 
However, if p“ is too large, it may not be practical to check all of the possible models. 


We divide the problem into 2 tiers in terms of the number of effects: 
m when p° < 20, we search all possible subsets 


m= when p° > 20, we apply a hybrid method which combines the forward stepwise method and 
the all possible subsets method. 
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Searching All Possible Subsets 


An efficient method that minimizes the number of sweep operations on the R matrix (Schatzoff 
1968), is applied to traverse all the models and outlined as follows: 


Each sweep step(s) on an effect results in a model. So 2”° models can be obtained 
through a sequence of exactly 2” sweeps on effects. Assuming that the all possible 
models on »° — 1 effects can be obtained in a sequence S,,_, of exactly 2” —1 sweeps 
on the first 2?" —-1 pivotal effects, and sweeping on the last effect will produce anew 
model which adds the last effect to the model produced by the sequence S,,-__, {hen 
repeating the sequence S,,-_ , will produce another 2?" —1 distinct models (including 
the last effect). It is a recursive algorithm for constructing the sequence; __ that is, 


Sp = (5): -1) k, Sp: -1) = (Sp —2) k- 1, Spe —2) k, Sp 2 k— 1, Spe _2) a ensel and So on. 


The sequence of models produced is demonstrated in the following table: 


k Sk Sequence of models produced 

0 0 Only intercept 

1 1 (1) 

2 121 (1),(12),(2) 

3 1213121 (1),(12),(2),(23),(123),(13),(3) 

4 121312141213121 (1),(12),(2),(23),(123),(13),(3),(84),(134),(1234),(234),(24),(124),(14),(4) 
p Spe_1, P®, Spe_1 All 2”” models including the intercept model. 


The second column indicates the indexes of effects which are pivoted on. Each parenthesis in the 
third column represents a regression model. The numbers in the parentheses indicate the effects 
which are included in that model. 


Hybrid Method 


If p*>20, we apply a hybrid method by combining the forward stepwise method with the all 
possible subsets method as follows: 


Select the effects using the forward stepwise method with the same criterion chosen for best 
subsets. Say that pS is the number of effects chosen by the forward stepwise method. 


Apply one of the following approaches, depending on the value of pS, as follows: 


m If pS<20, do an exhaustive search of all possible subsets on these selected effects, as 
described above. 


m If 20 <pS<40, select pS— 20 effects based on the p-values of type III sum of squares tests from 
all pS effects (see ANOVA in “Model evaluation”) and enter them into the model, then do an 
exhaustive search of the remaining 20 effects via the method described above. 


m If 40 <p%, do nothing and assume the best model is the one with these pS effects (with a 
warning message that the selected model is based on the forward stepwise method). 
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Model evaluation 


The following output statistics are available. 


ANOVA 


Weighted total sum of squares 


n 


Sse S- wilyi — 9)? =(N -1)Sy, withdf. =dft=N-1 
i=1 


where d.f. means degrees of freedom. It is called “SS (sum of squares) for Corrected Total”. 


Weighted residual sum of squares 


n 


VO A 2 ~ r Sa) 
So. ) wilyi — Yi)” =Fyy (N — 1) Syy 
i=l 


with d.f. = dfe=N-—p°. It is also called “SS for Error”. 


Weighted regression sum of squares 


n 
S6.= > wi(Gi—7) = Fy) (N = 1) Sy = 8S — 88 

i=l 
with d.f. =df,, = p*. It is called “SS for Corrected Model” if there is an intercept. 
Regression mean square error 
SS,/dfr 
Residual mean square error 


Sel Ose 


F statistic for corrected model 


-_ SS,/dfr = SS; P dfe 


- SSe/dfe SSe . df, 


which follows an F distribution with degrees of freedom df; and dfe, and the corresponding 
p-value can be calculated accordingly. 


| 


Type Ill sum of squares for each effect 
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To compute type III SS for the effect j, 7 =1,..-,p", the type III test matrix Lj 
needs to be constructed first. Construction of L; is based on the generating matrix 


H, = (xTpx) x! px, where D = diag (g1,.-.,9n), such that L;B is estimable. It involves 
parameters only for the given effect and the effects containing the given effect. For type III 
analysis, L; doesn’t depend on the order of effects specified in the model. If such a matrix cannot 
be constructed, the effect is not testable. For each effect j, the type III SS is calculated as follows 


aT, T Tee 
S;= ATL} (L)GL}) 1,3 


where G = (X’ WX) . 
F statistic for each effect 


The SS for the effect j is also used to compute the F statistic for the hypothesis test Ho: Lif 
= 0 as follows: 


S,;/r; 
a PA Ae 
a SSe/ df. 


where 7; is the full row rank of L;. It follows an F distribution with degrees of freedom r; and 
df., then the p-values can be calculated accordingly. 


Model summary 


Adjusted R square 


Y / e —_ 2 * fg 4 
adj. R? = 1 — 22e/e _ pe (1 Hf jer _,_ ah “Tuy 
SS, /d fi df, df, 
where 
z, . e Ss 
f 2 —= y — e ——s r ye 
amy $3 


Model information criteria 


Corrected Akaike information criterion (AICC) 


YY Eas S8¢ | 2p°N 
AIC C=N In( N ) l N — pe —] 


Coefficients and statistical inference 


After the model selection process, we can get the coefficients and related statistics from the swept 
correlation matrix. The following statistics are computed based on the R matrix. 
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Unstandardized coefficient estimates 


Bj = bj 


for j =1,---,p 
Standard errors of regression coefficients 


The standard error of (3; is 


var (3;) 


Intercept estimation 


riilyySuy 


Sid fi 


The intercept is estimated by all other parameters in the model as 
A P A 
Bp =¥— » ‘ IX j 

j=l 


The standard error of {39 is estimated by 


x =) 
Ox = CES 
Bo By 
where 
52 — (ND Sy D Fo a p ae 
"Bo N(N-p'=1) + jal 5%, Oi EX jcov( Br, B; 
aby . 363 ce) a) a ix Y. Pej XSS, 
= Nxdf. 1 2uj=1- se = kaj kev j 75,,.8,;x(N—Ndf.’ 


fc (N-1)r, 9 »—1 7) 
a? = Oa pe ee a +2 ye Pooks 41 XEX cov (5 9,) and cov (44.9 ;) is the 
kth row and jth column element i in the parameter Sctimates covariance matrix. 


t statistics for regression coefficients 


df, 


Pyy! 9) 


for 7 = 1,---,p*, with degrees of freedom df. and the p-value can be calculated accordingly. 


100(1—c)% confidence intervals 


Djtoy x ty /2,df, 
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Note: For redundant parameters, the coefficient estimates are set to zero and standard errors, t 
Statistics, and confidence intervals are set to missing values. 


Scoring 


Predicted values 


Pp 


i Sages el ees 9 
i=0 
Diagnostics 
The following values are computed to produce various diagnostic charts and tables. 
Residuals 
Ck = Yk — Yk 
Studentized residuals 


This is the ratio of the residual to its standard error. 


Ck 
(1—h,) 


Jk 


SRES), = 


where s is the square root of the mean square error; that is, s = USS, /df-, and h;, is the leverage 
value for the kth case (see below). 


Cook’s distance 


Crh 


COOK, = (1 = hy)! 


where the “leverage” 
hy = 9xx4Gxh 
is the kth diagonal element of the hat matrix 


H = w!/?x(xTwx) xTw!? = w!?xexTw!? 


A record with Cook’s distance larger than ~ a is considered influential (Fox, 1997). 
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Predictor importance 


We use the leave-one-out method to compute the predictor importance, based on the residual sum 
of squares (SSe) by removing one predictor at a time from the final full model. 


If the final full model contains p predictors, X, X2,---, X,,, then the predictor importance can be 
calculated as follows: 


1. i=l 
2. Ifi>p, go to step 5. 


3. Do a sweep operation on the corresponding row(s) and column(s) associated with X; in the 
R matrix of the full final model. 


4. Get the last diagonal element in the current R and denote it 7,,,. Then the predictor importance of 
ish Ge -- ) (NV —1)SS,,- Leti=it 1, and go to step 2. 
ie vu} uy 


5. Compute the normalized predictor importance of X;: 


Norm I; = ot 
24;-1V I; 
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LOGISTIC REGRESSION Algorithms 


Logistic regression regresses a dichotomous dependent (target) variable on a set of independent 
(predictor) variables. Several methods are implemented for selecting the independent variables. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n The number of observed cases 
The number of parameters 


y n X 1 vector with element y;, the observed value of the ith case of the 
dichotomous dependent variable 

Xx n Xx p matrix with element x;;, the observed value of the ith case of the 
jth parameter 

3 p x 1 vector with element 3), the coefficient for the jth parameter 

w n X 1 vector with element w:;, the weight for the ith case 

l Likelihood function 

L Log-likelihood function 

I Information matrix 


Model 


The linear logistic model assumes a dichotomous dependent variable Y with probability x, where 
for the ith case, 


exp (7); ) 
1+exp (7; ) 


7 
il 


i= 


or 


In (=) =he= X,8 


Hence, the likelihood function / for n observations ,,, y,» With probabilities ,,. and 
UV piace Tip oes TA 
case weights ,,, i, can be written as 
Wy 220, Wy, 
n 
+4 Wi Yi ( ~ \wi(l-yi) 

b= | [ae (1 — 7) 

i=1 


It follows that the logarithm of | is 

L=I\n(l)= ‘Si (w;y; In (7;) + wi(1 — y;) In (1 — 7;)) 
i=1 

and the derivative of L with respect to (3; is 


n 
# OL 1 = : 
Ly = oa, = 5 WilYi — Ti) Li; 
; i=1 
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Maximum Likelihood Estimates (MLE) 


The maximum likelihood estimates for 3 satisfy the following equations 


n 


_ wily; — 7 )ei; = 0, for the jth parameter 
t=1 
where xj = 1 fori =1,...,n. 


Note the following: 


1. A Newton-Raphson type algorithm is used to obtain the MLEs. Convergence can be based on 
m Absolute difference for the parameter estimates between the iterations 
m= Percent difference in the log-likelihood function between successive iterations 


m= Maximum number of iterations specified 


2. During the iterations, if 7;(1 — 7;) is smaller than 10-8 for all cases, the log-likelihood function 
is very close to zero. In this situation, iteration stops and the message “All predicted values 
are either 1 or 0” is issued. 


After the maximum likelihood estimates 3 are obtained, the asymptotic covariance matrix is 
estimated by /~', the inverse of the information matrix I, where 


io [e(a#4-)] =X WVX, 


V = Diag{mi (1-71), .-.,7,(1 —7,)}; 
W = Diag{wy, ..., Wy}, 

a exp (7); ) 

"t= TFexp (mi)? 

and 

Wi = Xi 8 


Stepwise Variable Selection 


Several methods are available for selecting independent variables. With the forced entry method, 
any variable in the variable list is entered into the model. There are two stepwise methods: 
forward and backward. The stepwise methods can use either the Wald statistic, the likelihood 
ratio, or a conditional algorithm for variable removal. For both stepwise methods, the score 
statistic is used to select variables for entry into the model. 


Forward Stepwise (FSTEP) 


1. If FSTEP is the first method requested, estimate the parameter and likelihood function for the 
initial model. Otherwise, the final model from the previous method is the initial model for FSTEP. 
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Obtain the necessary information: MLEs of the parameters for the current model, predicted 
probability, likelihood function for the current model, and so on. 


Based on the MLEs of the current model, calculate the score statistic for every variable eligible for 
inclusion and find its significance. 


Choose the variable with the smallest significance. If that significance is less than the probability 
for a variable to enter, then go to step 4; otherwise, stop FSTEP. 


Update the current model by adding a new variable. If this results in a model which has already 
been evaluated, stop FSTEP. 


Calculate LR or Wald statistic or conditional statistic for each variable in the current model. 
Then calculate its corresponding significance. 


Choose the variable with the largest significance. If that significance is less than the probability 
for variable removal, then go back to step 2; otherwise, if the current model with the variable 
deleted is the same as a previous model, stop FSTEP; otherwise, go to the next step. 


Modify the current model by removing the variable with the largest significance from the previous 
model. Estimate the parameters for the modified model and go back to step 5. 


Backward Stepwise (BSTEP) 


1. 


Estimate the parameters for the full model which includes the final model from previous method 
and all eligible variables. Only variables listed on the BSTEP variable list are eligible for entry 
and removal. Let the current model be the full model. 


Based on the MLEs of the current model, calculate the LR or Wald statistic or conditional statistic 
for every variable in the model and find its significance. 


Choose the variable with the largest significance. If that significance is less than the probability for 
a variable removal, then go to step 5; otherwise, if the current model without the variable with the 
largest significance is the same as the previous model, stop BSTEP; otherwise, go to the next step. 


Modify the current model by removing the variable with the largest significance from the model. 
Estimate the parameters for the modified model and go back to step 2. 


Check to see any eligible variable is not in the model. If there is none, stop BSTEP; otherwise, 
go to the next step. 


Based on the MLEs of the current model, calculate the score statistic for every variable not in 
the model and find its significance. 


Choose the variable with the smallest significance. If that significance is less than the probability 
for variable entry, then go to the next step; otherwise, stop BSTEP. 


Add the variable with the smallest significance to the current model. If the model is not the 
same as any previous models, estimate the parameters for the new model and go back to step 
2; otherwise, stop BSTEP. 
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Stepwise Statistics 


The statistics used in the stepwise variable selection methods are defined as follows. 


Score Statistic 


The score statistic is calculated for each variable not in the model to determine whether the 
variable should enter the model. Assume that there are r; variables, namely, a), ...,«,, in the 
model and r, variables, +,, ...,7,,, not in the model. The score statistic for 7; is defined as 


II 
— 
iol 
* 
— 
te 
ee) 
i) 
to 


S; 


if y, is not a categorical variable. If y, is a categorical variable with m categories, it is converted to 
a (m — |)-dimension dummy vector. Denote these new yp, — j variables as ¥;, ... |; , ,,, 5 The 
score statistic for 7: is then 


with 
Rok 
Ai =aVa 
Ali =a'V»;, 
— 1 < 
Aooi = Vi 


in which a is the design matrix for variables a,, ...,a,, and 4; is the design matrix for dummy 


variables 4;, ...,4j1.,—2. Note that a@ contains a column of ones unless the constant term 

is excluded from 7. Based on the MLEs for the parameters in the model, V is estimated by 
V= Diag{7\(1— 71), ...,7,(1 — %,,)}. The asymptotic distribution of the score statistic is a 
chi-square with degrees of freedom equal to the number of variables involved. 


Note the following: 


1. Ifthe model is through the origin and there are no variables in the model, B22; is defined by 
A;,', and V is equal to +I... 
2. If Bz»,; is not positive definite, the score statistic and residual chi-square statistic are set to be zero. 
Wald Statistic 


The Wald statistic is calculated for the variables in the model to determine whether a variable 
should be removed. If the ith variable is not categorical, the Wald statistic is defined by 


Wald; = 2 
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If it is a categorical variable, the Wald statistic is computed as follows: 


Let 3; be the vector of maximum likelihood estimates associated with the m — 1 dummy variables, 
and C the asymptotic covariance matrix for 3;. The Wald statistic is 


Wald; = B,C; 


The asymptotic distribution of the Wald statistic is chi-square with degrees of freedom equal to 
the number of parameters estimated. 


Likelihood Ratio (LR) Statistic 


The LR statistic is defined as two times the log of the ratio of the likelihood functions of two 
models evaluated at their MLEs. The LR statistic is used to determine if a variable should 

be removed from the model. Assume that there are 7, variables in the current model which is 
referred to as a full model. Based on the MLEs of the full model, [(full) is calculated. For each of 
the variables removed from the full model one at a time, MLEs are computed and the likelihood 
function I(reduced) is calculated. The LR statistic is then defined as 


LR=-2In (Sq) — —2(L(reduced) — L(full)) 


l( full) 


LR is asymptotically chi-square distributed with degrees of freedom equal to the difference 
between the numbers of parameters estimated in the two models. 


Conditional Statistic 


The conditional statistic is also computed for every variable in the model. The formula for the 
conditional statistic is the same as the LR statistic except that the parameter estimates for each 
reduced model are conditional estimates, not MLEs. The conditional estimates are defined as 
follows. Let 3 = ( Bi, ...,8 be the MLE for the r, variables in the model and C be the 


asymptotic covariance matrix for 3. If variable x; is removed from the model, the conditional 
estimate for the parameters left in the model given {3 is 


rl 


(i) (G)\ 74 
C ) 
Be) = Pi) — Cyo (c:) 3; 


where (3; is the MLE for the parameter(s) associated with xv, and 3,,) is 3 with 3; removed, ow is 
the covariance between 33,;) and 3;, and Cys is the covariance of B;. Then the conditional statistic 
is computed by 


-2(L(8.)) = L(full)) 


where L (3 ; ) is the log-likelihood function evaluated at Bi; yi 


Statistics 


The following output statistics are available. 
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Initial Model Information 


If 5 is not included in the model, the predicted probability is estimated to be 0.5 for all cases and 
the log-likelihood function L(0) is 


L(0) = Wn (0.5) = —0.6931472/W 


with W = x wj. If 39 is included in the model, the predicted probability is estimated as 
i=1 


n 


) WEY 
_ t=1 
Ty = Ww 


and (5, is estimated by 


Bo =In (5) 


with asymptotic standard error estimated by 


ee ee 
JW rol 1-7) 


o%, 


The log-likelihood function is 


1-7 


L(0) = W [fo In (45) +In(1 — io).| 


Model Information 


The following statistics are computed if a stepwise method is specified. 


-2 Log-Likelihood 


n 


-25¢ (wy; In (7) + wi(1 — y;) n (1 — 7; )) 


i=1 
Model Chi-Square 
2(log-likelihood function for current model — log-likelihood function for initial model) 


The initial model contains a constant if it is in the model; otherwise, the model has no terms. 

The degrees of freedom for the model chi-square statistic is equal to the difference between the 
numbers of parameters estimated in each of the two models. If the degrees of freedom is zero, the 
model chi-square is not computed. 


Block Chi-Square 


2(log-likelihood function for current model — log-likelihood function for the final model from 
the previous method) 
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The degrees of freedom for the block chi-square statistic is equal to the difference between the 
numbers of parameters estimated in each of the two models. 


Improvement Chi-Square 


2(log-likelihood function for current model — log-likelihood function for the model from the 
last step) 


The degrees of freedom for the improvement chi-square statistic is equal to the difference between 
the numbers of parameters estimated in each of the two models. 


Goodness of Fit 


3 wilyi — i)” 
; 7m; (1 — 7;) 


i= 


Cox and Snell’s R-Square (Cox and Snell, 1989; Nagelkerke, 1991) 


Rés =1- (2% 

where i( *) is the likelihood of the current model and /(0) is the likelihood of the 

initial model; that is, /(0) = W log (0.5) if the constant is not included in the model; 

(0) = W{x, log {7 /(1 — 7) } + log (1 — 7,)] if the constant is included in the model, where 


To = UP wiyi/W. 


Nagelkerke’s R-Square (Nagelkerke, 1981) 


R2, = Rég/ max (R23) 


where max (R2.<) = 1 — {1(0)}7/"". 


Hosmer-Lemeshow Goodness-of-Fit Statistic 


The test statistic is obtained by applying a chi-square test on a 2 x g contingency table. The 
contingency table is constructed by cross-classifying the dichotomous dependent variable with 
a grouping variable (with g groups) in which groups are formed by partitioning the predicted 
probabilities using the percentiles of the predicted event probability. In the calculation, 
approximately 10 groups are used (g=10). The corresponding groups are often referred to as the 
“deciles of risk” (Hosmer and Lemeshow, 2000). 

If the values of independent variables for observation i and i’ are the same, observations i and 
i’ are said to be in the same block. When one or more blocks occur within the same decile, the 
blocks are assigned to this same group. Moreover, observations in the same block are not divided 
when they are placed into groups. This strategy may result in fewer than 10 groups (that is, 
g < 10) and consequently, fewer degrees of freedom. 
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Suppose that there are Q blocks, and the qth block has mg number of observations, ¢ = 1, ..-, Q. 
Moreover, suppose that the kth group (k = 1,...,g) is composed of the qth, ..., q,th blocks of 
observations. Then the total number of observations in the kth group is s;, = oH m,. The total 
observed frequency of events (that is, Y=1) in the kth group, call it O1,, is the total number of 
observations in the kth group with Y=1. Let Fy; be the total expected frequency of the event in the 
kth group; then Ei, is given by F,;, = s;,£;, where €;. is the average predicted event probability 
for the kth group. 


— Wk ~. fo 
&, = Mj 5/ Sk 


The Hosmer-Lemeshow goodness-of-fit statistic is computed as 


a 2 
2 _ vr (O1n — Fix) 
ine =D Ex (1 — &&) 


9 


The p value is given by Pr(\” > \3,,,) where \” is the chi-square statistic distributed with 
degrees of freedom (g-2). 


Information for the Variables Not in the Equation 


For each of the variables not in the equation, the score statistic is calculated along with the 
associated degrees of freedom, significance and partial R. Let X; be a variable not currently in 
the model and S; the score statistic. The partial R is defined by 


[_Si-2xdf _ ifs. ~ 
Partial_R = V —2L(initial) fy “ Be 
0 df otherwise 


where df is the degrees of freedom associated with S;, and L(initial) is the log-likelihood 
function for the initial model. 
The residual Chi-Square printed for the variables not in the equation is defined as 


t 


Res = (13) Bolg 
where Lg = (Ee, asthe) 


Information for the Variables in the Equation 


For each of the variables in the equation, the MLE of the Beta coefficients is calculated along with 
the standard errors, Wald statistics, degrees of freedom, significances, and partial R. If XV; is not a 
categorical variable currently in the equation, the partial R is computed as 


/ Wald,— eT a 
Partial_R = sign( 9 Jy 24 initial) EW hee 
0 otherwise 


If X; is a categorical variable with m categories, the partial R is then 
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: [Wald 2m) if Wald; x 9(m — 
Partial_R = V —2L (initial) - o 
0 1) otherwise 


Casewise Statistics 


The following statistics are computed for each case. 


Individual Deviance 


The deviance of the ith case, G';, is defined as 


Gs JV 2(yi In(z;) + (1 — y;) nC. — 7;)) ify; > 7; 
—,/2(y: In (a) + 1—yi)In(1—7;)) otherwise 


Leverage 
The leverage of the ith case, h;, is the ith diagonal element of the matrix 
A 1 : 1 A =i NAL 
Vix(x'cVx) x'V! 
where 


V = Diag{m(1— 71), ...,7,(1 —7n)} 


Studentized Residual 


Logit Residual 


on ey 
a= 4 


a 74 (1—7;) 


where e; = yi — Ty 


Standardized Residual 


eh 


i 


V mi(l-7;) 


Cook’s Distance 


- 
zi hy 


1—h; 


DFBETA 


Let A; be the change of the coefficient estimates from the deletion of case i. It is computed as 
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AB; = l-h; 
Predicted Group 
If 7; > 0.5, the predicted group is the group in which 


y=1. Note the following: 


For the unselected cases with nonmissing values for the independent variables in the analysis, 
the leverage (ii:) is computed as 


4 Vin? 
h; =h; - — 
‘ : 14+Vih 


where 
hs 0x',(X'CVX) “X, 


For the unselected cases, the Cook’s distance and DFBETA are calculated based on h i 


LOGLINEAR Algorithms 


The LOGLINEAR procedure models cell frequencies using the multinomial response model and 
produces maximum likelihood estimates of parameters by the Newton-Raphson method. The 
contingency tables are converted to two-way IxJ tables, with I and J being the dimensions of the 
independent and dependent categorical variables respectively. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Nij Observed frequency of cell (i, j) 
I Dimension of the row variable, associated with independent variables 
J Dimension of the column variable, associated with dependent variables 


wij Weight of cell (i, j) 


Br. Coefficients in the loglinear model; 1 < k < p 
BW Estimate of ,3;, at the Ith iteration 

Bs Final estimate of 3x 

Mij Expected values of nj; 

mi ; ) Estimate of m,, at the Ith iteration 

ij Estimate of ,,, ,) at the final iteration 

Ma, 


Model 


In the general LOGLINEAR model, the logarithms of the cell frequencies are formulated as a 
linear function of the parameters. The actual form of the model is determined by the contrast and 
the effects specified. The model has the form 


Pp 
vis =In(S2) =A +O Gemije 1SESTISI<I 
k=1 


where \; are chosen so that ye, Ut ys njj, and x; j;, are the independent variables in the 
linear model. 
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Contrasts 


The values of «x; ;;, are determined by the types of contrasts specified in the procedure. The default 
contrast is DEVIATION. 


Computational Algorithm 


To estimate the coefficients, a series of weighted regressions is used for iterative calculations. The 
iterative process is outlined (also see Haberman, 1978) as follows: 


(1) Obtain initial approximations y; iG and use them to obtain Bi. 


: : . (1) (1) 
(2) Obtain the next approximations y;;' and ™;;". 
3) Use the updated y. ') in (2) to obtain the next approximations 3\"’, 
Pp ij Pp k 


(4) Repeat steps 2 and 3, replacing BO with gt “)) Continue repeating this until convergence is 
achieved. 


The computations begin with selection of initial approximations mi =n; +6 forn;;. The 
default for 6 is 0.5. If the model is saturated, 6 is added to n;; permanently. If the model is not 


saturated, 6 is added to n;; only at the initial step and is then subtracted at the second step. 


The maximum likelihood estimates ,3;. of {3;, are found by the Newton-Raphson method. Let 
6" be the column vector containing the ML estimates at the th iteration; then 


: - -1 () 
30) — (C) gq 


0 4 (CUD) Qln, for] > 0, 


where the (k, 1)-element of C'") is 


forl<i<J,l<k<p 


and the kth element of a") is 


: (0) (0) 

) LijkM; S YijM;,; 
(0) ; (0). (0) J tJ 
Ce LijkYis Mi; ay 
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and the kth element of «") is 


(2) / (1) 
a, = S rijn( maj — Mm; ) for] > 1 
ij 


The estimated cell means are updated by 


t—1) 

(l Twi, exp [v, : 7 
m\) = (vi ) for! >1 
aj ; (1-1) = 

y Wij EXP (uf ) 
a,j 
where 


Y= (nij +6) if the model is saturated 


> (nij) otherwise 


and 


Pp 
(1-1) . (l-1) 
vi = ad ijk Pp, 
k=1 


The iterative process stops when either the maximum number of iterations (default=20) is reached 
or 


pith) _ 4) 


max; j]0; ; Ui 


| <e (with default « = 0.001). 


Computed Statistics 


The following output statistics are available. 


Correlation Matrix of Parameter Estimates 


Let C be the final C\) and H = C~!. The correlation between 3; and 3; is computed as 


hij 


Ji iihyy 


Goodness of Fit 


The Pearson chi-square is computed as 


, ; 2 
2 (niz — mij) 
a? eer ae 

™Mi5 


i,j 


and the likelihood-ratio chi-square is 


. Nij 
L=2 y nijln | - 

—_ * Mi; 

iJ 7 
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The degrees of freedom are I x (.J — 1) — p— E, where E is the number of cells with 
nj; wij < 0 and p is the number of coefficients in the model. 


Residuals 


The following residuals are available. 


Unadjusted Residuals 


residual;; = nij — 4; 


Standardized Residuals 


. nNij-—M; 
standard residual; ; = ++ 


V mij 


Adjusted Residuals 


nij—mij 
adjusted residual; ; = ~=2+ 
J Jaa 


where 


_ a Mh; - : A Z ra 
Sig = Mag | L — at th; ) (sein = 6ix ) (ijn _ Gi) a 


kl 


y MiGLijk 

; j 

Gi = : 
) My F 


j 


Generalized Residuals 


Consider a linear combination of the cell counts 


S dj; Nij 


i.) 


The estimated expected value is computed as 


) dj jm; 


1.) 


Two generalized residuals are computed. 


Unadjusted Residuals 


residual = y di jNij — ) dj jj; 


a.) a.) 


Adjusted Residuals 


) dejteg => dj ji; 


iJ J) 
VOL 


adjusted residual = 


where 


2 

(St ee 
=> > bn - > | --S Soe fe files 
ij i iS Maj k=1 l=1 


j 
fre = oe mjd; an = 6x) 
iJ 
Analysis of Dispersion 
Following Haberman (1982), define 
S(Y) = Total dispersion 
S(¥ |X) = Conditional dispersion 
S(X) = Dispersion due to fit 
S(X) 


R= 5 = Measure of association 


For entropy 


J 
S(Y) =-M)~p;In(p;) 
j=l 


SY 


I J 
X)= So Mie Big In (Pity) 
i=l j=1 


S(X) = S(Y)-—S(Y|X) 


For concentration 
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where 

~ Me; 
Pj >= 

- _ mij 
Pi\i > TT 


Mie 


Haberman (1977) shows that, under the hypothesis that Y and X are independent, 
Ve = 25(X) > X7¢7-1) 
in the case of entropy, and 


M(J-1)S(X) 


1, — 2 
Yo = SY) 7 XI-1 


in the case of concentration. 
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MANOVA Algorithms 


The program performs univariate and multivariate analysis of variance and covariance for any 
crossed and/or nested design. 


Analysis of Variance 


The following topics detail computational formulas for MANOVA’s use in the analysis of variance. 


Notation 


The experimental design model (the model with covariates will be discussed later) can be 
expressed as 


Y: a WwW o | E 
NxXp Nxm mxXp N xp 
where 
Y is the observed matrix 
W is the design matrix 
8 is the matrix of parameters 
E is the matrix of random errors 
N is the total number of observations 
Pp is the number of dependent variables 
m is the number of parameters 


Since the rows of W will be identical for all observations in the same cell, the model is rewritten 


in terms of cell means as 


Y,. A B } EB, 
g9Xp gxm mxp 9XP 


where g is the number of cells and Y, and E, denote matrices of means. 


Reparameterization 
The reparameterization of the model (Bock, 1975; Finn, 1977) is done by factoring A into 


A = kK L 
gxm gxr rxm 
K forms a column basis for the model and has rank r. The contrast matrix L contains the 
coefficients of linear combinations of parameters and has rank r. L can be specified by the user. 
Given L, K can be obtained from AL’(LL’)~?. For designs with more than one factor, L, and 
hence K, can be constructed from Kronecker products of contrast matrices of each factor. After 
reparameterization, the model can be expressed as 
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* =A6+E 

g9Xp 

= K(L3)+E 

_K q + E 

 gxXr TxXp g9Xp 
Parameter Estimation 


An orthogonal decomposition (Golub, 1969) is performed on K. That is, K is represented as 
K=QR 


where Q is an orthonormal matrix such that Q’DQ = I; D is the diagonal matrix of cell 
frequencies; and R is an upper-triangular matrix. 


The normal equation of the model is 
(K'DK)é=K'DY 

or 

Ré=Q DY =U 


This triangular system can therefore be solved forming the cross-product matrix. 


Significance Tests 
The sum of squares and cross-products (SSCP) matrix due to the model is 
éRRI=UU 


and since var(U) = Rvar(@)R’ =1@S the SSCP matrix of each individual effect can be 
obtained from the components of 


U'; 
UU =(Ny,...,Ux)| 2 | =U.U' 1 +...+U,U', 


U's 
Therefore the hypothesis SSCP matrix for testing H, : #;, = Ois 


Su = U;, U h 
pXDPXNp Ny Xp 


The default error SSCP matrix is the pooled within-groups SSCP: 
Se =YY-Y DY 
if the pooled within-groups SSCP matrix does not exist, the residual SSCP matrix is used: 


Se=YY-—UU 
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Four test criteria are available. Each of these statistics is a function of the nonzero eigenvalues 
\; of the matrix S;,S;'. The number of nonzero eigenvalues, s, is equal to min (p,n;,). 


Pillai’s Criterion (Pillai, 1967) 


T=)" /(0 + ri) 
t=1 


Approximate 7 = (n,. — p—s)T/(b(s — T)) with b, and s(n, — p + s) degrees of freedom, where 


n- = degrees of freedom Sy 
b = max (p, np) 


Hotelling’s Trace 


T= s ij 
i=1 


Approximate F = 2(sn + 1)T/(s?(2m +s 4+ 1)) with s(2m 4 s+ 1) and 2(sn + 1) degrees of 
freedom where 


m = (|n_, — p| — 1)/2 
n=(n.—p—1)/2 


Wilks’ Lambda (Rao, 1973) 


i ene + ri) 
i=1 


Approximate F = (1 —7'/')(M1l+1—mnnpp/2)/(T'/'n»p) with nnp and 
(M1+1—np,p/2) degrees of freedom, where 


I? = (p?n? —4)/(p? + n7 —5) 
M =e —(p+1—n,)/2 


Roy’s Largest Root 


T = ,/(1+ 1) 


Stepdown F Tests 


The stepdown F statistics are 


(17-12), nh 


t?/(ne-i 1) 


F; = 


with n);, and n, — i+ 1 degrees of freedom, where ¢,. and t are the ith diagonal element of T ~ and 
T respectively, and where 


Se=TsTz 
Se+Sy=TT 
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Design Matrix 


K 


Estimated Cell Means 


Y. = Ké 


Analysis of Covariance 


Y = K 0 + &, B + &E 
9Xp gxTr TXp GX xp 9Xp 


where g, p, and r are as before and q is the number of covariates, and X, is the mean of X, the 
matrix of covariates. 


Parameter Estimation and Significance Tests 


For purposes of parameter estimation, no initial distinction is made between dependent variables 
and covariates. 


Let 
V =(YX) 
Vv. = (Y.X.) 


The normal equation of the model 


Vv. = K 0 | E, 
9x (p+q) gxr rx(p+q) gX(p+q) 


is 


6 = (44: 6x) 
rx(p+q) rxp 1Tx@q 
If S¢ andS, are partitioned as 


6”) g(XX) 
Sz= ( xy) (x) 
a) Se 


(Y) (YX) 

ee) 
4 (- ) A) 
so ae 
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then the adjusted error SSCP matrix is 
ee, -\\ —l ; 

* (Y) (YX) (X) (XY) 
Ss, = sy) -s}~) (SY) sf 
and the adjusted total SSCP matrix is 
gx, — 9(¥) _ g(¥®) (s;" Nese 

ie r T Lr 
The adjusted hypothesis SSCP matrix is then 
Sip = Sp —Sz 
The estimate of B is 
i: g(X) “*giXy) 

= (se) “sf 

The adjusted parameter estimates are 
0° = Ay. i 6.B 
The adjusted cell means are 


Y* =Ké" 


Repeated Measures 


The following topics detail computational formulas for MANOVA’s use in the analysis of repeated 
measures data. 


Notation 


The following notation is used within this section unless otherwise stated: 


k Degrees of freedom for the within-subject 
SSE* factor Orthonormal transformed error matrix 
N Total number of observations 
ndfb Degrees of freedom for all between-subject factors (including the constant) 
Statistics 


The following statistics are available. 


Greenhouse-Geisser Epsilon 


2 


eS (tr(SSE*)) 
88EPS = kxtI((SSE~)*) 
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Huynh-Feldt Epsilon 


N x kx ZepS—2 
kx(N—ndfb) —i x ggeps 


hfeps = 
if hfeps>1, set hfeps=1 


Lower Bound Epsilon 


Ibeps=+ 


Effect Size 


The effect size gives a partial eta-squared value for each effect and parameter estimate 


Notation 


The following notation is used within this section unless otherwise stated: 


dfh Hypothesis degrees of freedom 

dfe Error degrees of freedom 

F F test 

Ww Wilks’ lambda 

s Number of non-zero eigenvalues of HE~' 
T Hotelling’s trace 

Vv Pillai’s trace 


Statistic 
Partial eta-squared = te = os 
Eta — squared(Wilks') = 1 — yy7!/* 
Eta — squared(Hotelling's) = 14+ 


sum of squares for effect 
total(corrected )sum of squares 


Total eta-squared = 


__ SS for effect—df(effect) x MSE 
Hay's omega-squared = corrected total SS- MSE 


Pillai = V/s 


Power 


The following statistics pertain to the observed power of F and t tests performed by the procedure. 
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Univariate Non-Centrality 


_ SS hyp Zen 
Ae SS error x df 


Multivariate Non-Centrality 
For a single degree of freedom hypothesis 
A=T x df e€ 


where T is Hotelling’s trace and dfe is the error degrees of freedom. Approximate power 
non-centrality based on Wilks’ lambda is 


_ _Wilks' eta square 
~ 1-Wilks' eta square 


x dfe(W) 


where djf‘e(V’) is the error df from Rao’s F-approximation to the distribution of Wilks’ lambda. 


Hotelling’s Trace 


_ _Hotelling’s etasquare |, 
Ne 1-Hotelling's eta square * dfe(H) 


where df e( 7) is the error df from the F-approximation to the distribution of Hotelling’s trace. 


Pillai’s Trace 


_ _Pillai's eta square 
~ 1-Pillai's eta square 


x dfe(P) 
where df e(P) is the error df from the F-approximation to the distribution of Pillai’s trace. 


Approximate Power 


Approximate power is computed using an Edgeworth Series Expansion (Mudholkar, Chaubey, 
and Lin, 1976). 


r=v+A 
b=A/r 
K , \i/3 1. 22+1) — 400? 80(1+3b+33b7—77b°) | 176(1+4b—210b7+2380b° 2975 ) 
M1 v7 or 3ap2 Tt 37s t 3074 
1/3 2 80 176 
UAL ots + stag + tt) 
Ky = » \7/3 (2641) , 16p2 _-8(13+39b+40567—102567)  160(1+4b—876? +1168b —1544b*) 
12> UL or + 3892 — F's 
2/3 (_2 104 160 
+c (2 a7 us se) 


, - 8b? 32(1+3b+21b7—62b°) — 32(8+32b—177b? +4550b9 —6625b* ) 
K3 = v 27r2 3073 38r4 
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Kass ‘( 7 "( 16(1 Sb 126?— 446?) 256(144b+66"4 247? — ) ) 
Vy OC hk ed oor 
A/3f 16 256 
c (ss 28.) 


Power = 1— &(Y) — oe ¥ 7/2! Ka (y? 1) + Ke (v3 — sv) (vy - 10¥8 4 sy) } 


Confidence Intervals 


The intervals are calculated as follows: 
Lower bound = parameter estimate —k * stderr 
Upper bound = parameter estimate + k * stderr 


where stderr is the standard error of the parameter estimate, and k is the critical constant whose 
value depends upon the type of confidence interval requested. 


Univariate Individual Confidence Intervals 
where 
ne is the error degrees of freedom 
a is the confidence level desired 


F is the percentage point of the F distribution 


Univariate Intervals Joint Confidence Intervals 


For Scheffé intervals: 


k= \/(nh* F(a;nh,ne)) 

where 

ne is the error degrees of freedom 

nh is the hypothesis degrees of freedom 
a is the confidence level desired 


F is the percentage point of the F distribution 


For Bonferroni intervals: 

k =t(a/(2*nh),ne) 

where 

ne is the error degrees of freedom 

nh is the hypothesis degrees of freedom 
ais 100 minus the confidence level desired 


F is the percentage point of Student’s t distribution 


Multivariate Intervals 


The value of the multipliers for the multivariate case is computed as follows: 


Let 


p = the number of dependent variables 
nh =the hypothesis degrees of freedom 
ne = the error degrees of freedom 

a = the desired confidence level 

s = min(p,nh) 

m = (|nh — p| — 1)/2 
n=(ne—p—1)/2 


For Roy’s largest root, define 

c=G/(1—G) 

where 

G = GCR(a;s,m,n); the percentage point of the largest root distribution 


For Wilks’ lambda, define 


t= (p*nh)? —4 
b=pxeptnh*xnh—5 


r= ,/(t/b) ifb 40, else =1 
u=(pxenh—2)/4 
t—pxnh 


b=(nh+ne—(p+nh+1)/2)*r—2*u 
f = (t* F(a;t,b))/b 

W =(1/(1+0c))" 

e=(1_W)/W 


For Hotelling’s trace, define 
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t= s(2m+s+1) 
b= 2(sn +1) 

T = (stF(a;t,b))/b 
c=T 


For Pillai’s trace, define 


ne—p-+s) 


( 
( 

D = (F(a;t,b)t)/b 
( 


Now for each of the above criteria, the critical value is 


Kt = \/(ne *c) 
For Bonferroni intervals, 
K = t(a/(2p(nh)); ne) 


where t is the percentage point of the Student’s ¢ distribution. 


Regression Statistics 


Correlation between independent variables and predicted dependent variables 


(xa) <# 


where 


X,; =i. th predictor (covariate) 
Yj =j th predicted dependent variable 


r;; = correlation betweew th predictor andjth dependent variable 


R, = multiplé? foy th dependent variable across all predictors 
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MEANS Algorithms 


Cases are cross-classified on the basis of multiple independent variables, and for each cell of the 
resulting cross-classification, basic statistics are calculated for a dependent variable. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 58-1 
Notation 

Notation Description 

Xip Value for the pth independent variable for case i 
¥; Value for the dependent variable for case i 

wi Weight for case i 

P Number of independent variables 

N Number of cases 

Statistics 


For each value of the first independent variable (.,), for each value of the pair (X,, X2), for the 
triple (X,,.X»2, X3), and similarly for the P-tuple (X,,X», ..., Xp), the following are computed: 


Sum of Case Weights for the Cell 


W= ‘ wl; 


i=1 


where [; = 1 if the ith case is in the cell, J; = 0 otherwise. 


The Sum and Corrected Sum of Squares 


: 
SMY =) - wiliY; 
i=1 
N : 
ssy =) wii? 
i=1 


CSS = SSY —SMY?2/W 
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The Mean 


> with? 


i=1 


Both summations are over cases with positive wj values. 


Geometric mean 


= N 
Yg= (11 i) 
i=1 


The product is taken over cases with positive w; values. 


L/W 


Variance 
gz _ CSS 
as | a 
Standard Deviation 


S = V variance 


Standard Error of the Mean 


Skewness (computed if W>3 and S > 0), and its standard error 


oes WMs - - GW (W—1) 
= wapwoys se(gi) = (a a oh 
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Kurtosis (computed if W> 4 and S > 0), and its standard error 


W(W+1)M,-3(W-1) Mz 


oo) 3521 = oy 4(W2-1)se(gi)” 
92 = —Wwaiyw=aywaajst  8€(92) = 


(W—3)(W+5) 


Minimum 


min X; 
4 


Maximum 


max X; 
a 


Range 


Maximum — Minimum 


Percent of Total N 


For each category j of the independent variable, 


N 


2 wil; 


%TotN; = =o x 100 


where |; = 1 if the ith case is in the jth category, 1; = 0 otherwise. 


Percent of Total Sum 


For each category j of the independent variable, 
N 
Dae 


ALT otSum; = = x 100 


where |; = 1 if the ith case is in the jth category, 1; = 0 otherwise. 
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Median 


Find the first score interval (x2) containing more than t cases. 


rr if t — cp, 2 100/W 
6 1 W +1)/2-—ee|}x ; 
where 
t=(W +1)/2 


cpi <t< cpp 

vr, and v2 are the values corresponding to cp; and cp2, respectively 
cc, is the cumulative frequency up to 7, 

cp,is the cumulative percent up to, 


Grouped Median 


For more information, see the topic “Grouped Percentiles”. 


ANOVA and Test for Linearity 


If the analysis of variance table or test for linearity are requested, only the first independent 
variable is used. Assume it takes on J distinct values (groups). The previously described statistics 
are calculated and printed for each group separately, as well as for all cases pooled. Symbols 
subscripted from 1 to J will denote group statistics, unsubscripted the total. Thus for group j, 


m SY; is the sum of the dependent variable. 


and 


m X; the value of the independent variable. Note that the standard deviation and sum of squares 
printed in the last row of the summary table are pooled within group values. 


Analysis of Variance 


Source Sum of Squares df 
Between Groups Total-Within Groups J-1 


Regression J J J 7 |1 
(s Xx; SM } f; = (x wy; Xj (3° SM ‘) iw’) 
j=l = geal 


J 


J J 4 
> wi X5 - (x: wy x] /w 
j=l j=l 

Deviation from Regression | Between-Regression 


Source Sum of Squares 
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df 


Within Groups 


J 

>> css; 

j=l 

J J 2 
yes = is suv] /W 
j=1 j=l 


Total 


wWw-—J 


W-i1 


The mean squares are calculated by dividing each sum of squares by its degrees of freedom. 
The F ratios are the mean squares for each source divided by the within groups mean square. 
The significance level for the F is from the F distribution with the degrees of freedom for the 
numerator and denominator mean squares. If there is only one group the ANOVA is not done; 
if there are fewer than three groups or the independent variable is a string variable, the test for 


linearity is not done. 


Correlation Coefficient 


ei J 
S_ Xj;SMY; — | > WjX; | Suy/W 
‘ j=l. lage 
2 
J. Wi 
SW) X} -— | )> WX; | /W | (SSY - SMY?/W) 
ja) joi 
Eta 
.g_ Sum of Squares Between Groups 
ig) Total Sum of Squares 
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MIXED Algorithms 


This document summarizes the computational algorithms for the linear mixed model (Wolfinger, 
Tobias, and Sall, 1994). 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


0 Overall covariance parameter vector 

el A vector of covariance parameters associated with random effects. 

OK A vector of covariance parameters associated with the kth random effect. 
OR A vector of covariance parameters associated with the residual term. 


K Number of random effects. 
r Number of repeated subjects. 


Si Number of subjects in kth random effect. 
Vv The nxn covariance matrix of y. This matrix is sometimes denoted by V,, (@). 


OV First derivative of V with respect to the sth parameter in 0. 


o@ 


ev Second derivative of V with respect to the sth and tth parameters in 0. 


O06 ,0804 


R The nxn covariance matrix of ¢. This matrix is sometimes denoted byR (#R ) 
oR First derivative of R with respect to the sth parameter in 6p. 


a? Second derivative of R with respect to the sth and tth parameters in Op. 


G The covariance matrix of random effects. This matrix is sometimes denoted 
by G (6c) 
2G First derivative of G with respect to the sth parameter in @c. 


aa a Second derivative of G with respect to the sth and tth parameters in 6G. 


Vic The covariance matrix of the kth random effect for one random subject. This 
matrix is sometimes denoted by V;, (@:). 


VV, First derivative of V; with respect to the sth parameter in 6;.. 


a Vv; Second derivative of V;, with respect to the sth and tth parameters in 4;. 
O0,00~ °™ 


nx1 vector of dependent variable. 

nxp design matrix of fixed effects. 

nxq design matrix of random effects. 
nx1 vector of residuals. 

p*1 vector of fixed effects parameters. 
q*1 vector of random effects parameters. 


m2 AN MX 


nx1 vector of residual error. 
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Model 


W. nxn diagonal matrix of case weights. 


Ww nxn diagonal matrix of regression weights. 


In this document, we assume a mixed effect model of the form 
y=X0+Zy+.€ 


In this model, we assume that ¢ is distributed as \V [0, R} and y is independently distributed 

as V(0,G]. Therefore y is distributed as V [X3,V], where V = ZGZ? +R. The unknown 
parameters include the regression parameters in B and covariance parameters in @. Estimation 
of these model parameters relies on the use of a Newton-Ralphson or scoring algorithm. When 
we use either algorithm for finding MLE or REML solutions, we need to compute V~! and 

its derivatives with respect to 0, which are computationally infeasible for large n. Wolfinger 
et.al.(1994) discussed methods that can avoid direct computation of V~!. They tackled the 
problem by using the SWEEP algorithm and exploiting the block diagonal structures of G and R. 
In the first half of this document, we will detail the algorithm for mixed models without subject 
blocking. In second half of the document we will refine the algorithm to exploit the structure of 
G; this is the actual implementation of the algorithm. 


If there are regression weights, the covariance matrix R will be replaced by 

R* = wr 7RWa /2 For simpler notations, we will assume that the weights are already 
included in the matrix R. and they will not be displayed in the remainder of this document. When 
case weights are specified, they will be rounded to nearest integer and each case will be entered 
into the analysis multiple times depending on the rounded case weight. Since replicating a case 
will lead to duplicate repeated measures (Note: repeated measures are unique within a repeated 
subject), non-unity case weights will only be allowed for R with scaled identity structure. In 
MIXED, only cases with positive case weight and regression weight will be included analysis. 


Fixed Effects Parameterization 


The parameterization of fixed effects is the same as in the GLM procedure. 


Random Effects Parameterization 


If we have K random effects and there are 5), random subjects in kth random effect, the design 
matrix Z will be partitioned as 


L= [Zy Ze see Zx | 


where Z, is the design matrix of the kth random effect. Each Z;, can be partitioned further by 
random subjects as below, 


Zk = [Zu Zx2 ra Zisk |, k=1,..,K 
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The number of columns in the design matrix Z), (jth random subject of kth random effect) is equal 
to number of levels of the kth random effect variable. 


Under this partition, the G will be a block diagonal matrix which can be expressed as 
G= of ,[Isk @ Vk] 


It should also be noted that each random effect has its own parameter vector 4;, k=1,...,K, and there 
are no functional constraints between elements in these parameter vectors. Thus@q = (01,..., 4). 


When there are correlated random effects, Zi); will be a combined design matrix of the correlated 
random effects. Therefore in subsequent sections, each random effect can either be one single 
random effect or a set of correlated random effects. 


Repeated Subjects 


When the REPEATED subcommand is used, R will be a block diagonal matrix where the ith 
block is R;, Sp. That is, 


The dimension of R; will be equal to number of cases in one repeated subject but all R; share the 
same parameter vector (Rr. 


Likelihood Functions 
Recall that the —2 times log likelihood of the MLE is 
—2y pe (3,0) =log|/V| +r? V—!r + nlog 27 
and the —2 times log likelihood of the REML is 


—2lyie(@) =log|V| +r? V—!r + log |X! V~'X) + (n —p) log 27 


where n is the number of observations and p is the rank of fixed effects design matrix. The key 
components of the likelihood functions are 


1, (0) = log |V| 

lg (@) =r? Vir 

[3 (@) = log |x! vox 

Therefore, in each estimation iteration, we need to compute /; (4), /2 (@) and /z (@) as well as their 
1st and 2nd derivatives with respective to 6. 


Newton & Scoring Algorithms 


Covariance parameters in § can be found by maximizing the MLE or REML log-likelihood; 
however, there are no closed form solutions in general. Therefore Newton and scoring algorithms 
are used to find the solution numerically. The algorithm is outlined as below, 
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1. Compute starting value and initial log-likelihood (REML or ML). 


2. Compute gradient vector g and Hessian matrix H of the log-likelihood function using last 
iteration’s estimate @;_). (See later section for computation of g and H) 


3. Compute the new step d = —H~'g. 
4. Let p=1. 
5. Compute estimates of ith iteration 6; = 6;-1+ ad 


6. Check if 6; generates valid covariance matrices and improve the likelihood. If not, reduce p by 
half and repeat step (5). If this process is repeated for pre-specified number of times and the 
stated conditions are still not satisfied, stop. 


7. Check convergence of the parameter. If convergence criteria are met, then stop. Otherwise, 
go back to step (2). 


Newton’s algorithm performs well if the starting value is close to the solution. In order to improve 
the algorithm’s robustness to bad starting values, the scoring algorithm is used in the first few 
iterations. This can be done easily be applying different formulae for the Hessian matrix at each 
iteration. Apart from improved robustness, the scoring algorithm is faster due to the simpler 
form of the Hessian matrix. 


Convergence Criteria 


There are three types of convergence criteria: parameter convergence, log-likelihood convergence 
and Hessian convergence. For parameter and log-likelihood convergence, they are subdividedinto 
absolute and relative. If we let ¢ be some given tolerance level and #, ; be the sth parameter in ith 
iteration, J; be the log-likelihood in the ith iteration, g; be the gradient vector in ith iteration, and 
H,; be the hessian matrix in ith iteration, then the criteria can be written as follows: 


Absolute parameter convergence: max, 


0. —.5-1| <€ 


|0s,i—98 i-1| 2 


Relative parameter convergence: max, ap OSE 


Absolute log-likelihood convergence: |J; —/;_,| < 


NN 
fa) 


irli-il - 
ia] 


fan) 


Relative log-likelihood convergence: 

Absolute Hessian convergence: g/ H>'g; < € 
. : I -1 

Relative Hessian convergence: at <e€ 


Denominator terms that equal 0 are replaced by 1. 
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Starting value of Newton’s Algorithm 


If no prior information is available, we can choose the initial values of G and R to be the identity. 
However, it is highly desirable to estimate the scale of the parameter. By ignoring the random 
effects, and assuming the residual errors are i.i.d. with variance 7”, we can fit aGLM model and 
estimate a” by the residual sum of squares c?. Then we choose the starting value of Newton’s 
algorithm to be 


_ a = a? 
Ge = Kol and R = Kyl 


Confidence Intervals of Covariance Parameters 


The estimate 6 (ML or REML) is asymptotically normally distributed. Its variance covariance 
matrix can be approximated by -2H~!, where H is the hessian matrix of the log-likelihood 
function evaluated at #A simple Wald’s type confidence interval for any covariance parameter 
can be obtained by using the asymptotic normality of the parameter estimates, however it is 
not very appropriate for variance parameters and correlation parameters that have a range of 
[0, 00) and [-1, 1] respectively. Therefore these parameters are transformed to parameters that 
have range (—oo, co). Using the uniform delta method, see for example (van der Vaart, 1998), 
these transformed estimates still have asymptotic normal distributions. 

Suppose we are estimating a variance parameter o” by o? that is distributed as 


r 
N [o?, Var (a2)] asymptotically. The transformation we used is log (o*) which can correct 
the skewness of Go, moreover log (a7) has the range (—oo, 00) which matches that of normal 
distribution. Using the delta method, one can show that the asymptotic distribution of log (7) is 


N [log (07) ,o~'Var (a2)]. Thus, a (I-a)100% confidence interval of log (7?) is given by 


log (62) + 219/267, V Var (a?) 


where =,_,,/. is the upper (1 — q/2) percentage point of standard normal distribution. By this 


confidence interval, a (1-a)100% confidence interval for o? is given by 
exp (log (62) + 24 ~a/26,7/Var (62 )) 


When we need a confidence interval for a correlation parameter p, a possible transformation 
will be its generalized logit arctanh (p) = 0.5 log ((1 + p) / ( L— p)). The resulting confidence 
interval for p will be 


tanh (arctanh(p) + 2 ~a/2(1- p2) > \/Var (7)) 


Fixed and Random Effect Parameters: Estimation and Prediction 


After we obtain an estimate of ,the best linear unbiased estimator (BLUE) of B and the best 
linear unbiased predictor (BLUP) of y can be found by solving the mixed model equations, 
Henderson (1984). 


X™R''X x 5 me 4 B = X?R-ly 
ZTR2x ZTR1x+6"1/\,) =| zTR-y 
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The solution of this equation can be expressed as 


The covariance matrix C of 6 and 4 is given by 


C =Cov (2,4) 


_[XTRX XTR“!Z . 
=|Frk1y hn 6 | 
[Cn Cr 
= [en San| 


Cx = =GHV XGii 


C= (gtRzee4)~ = O9k? V-ZG 


Custom Hypotheses 


In general, one can construct estimators or predictors for 
[ 3 
Lb = [Lo Ly ] “a | 


for some hypothesis matrix L. Estimators or predictors of Lb can easily be constructed by 
substituting 3 and 4 into the equation for Lb and its variance covariance matrix can be 
approximated by LCLT. If Ly is zero and Lo(3 is estimable, Lb is called the best linear unbiased 
estimator of Lo3. If L; is nonzero and L933 is estimable, Lb is called the best linear unbiased 
predictor of Lb. 


To test the hypothesis Hy) : Lb =a for a given vector a, we can use the statistic 


(Lb—a) ‘ (LCLT)~*(Lb—a) 
q 


F = 


where q is the rank of the matrix L. This statistic has an approximate F distribution. The 
numerator degrees of freedom is q and the denominator degree of freedom can be obtained by 
Satterthwaite (1946) approximation. The method outlined below is similar to Giesbrecht and 
Burns (1985), McLean and Sanders (1988), and Fai and Cornelius (1996). 
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Satterthwaite’s Approximation 


To find the denominator degrees of freedom of the F statistic, first perform the spectral 
decomposition LCL? = I’ DI where IF is an orthogonal matrix of eigenvectors and D is a 
diagonal matrix of eigenvalues. If !,,, is the mth row of TL, d,,, is the mth eigenvalues and 


Vv. = 2d, 


ow gm=(0) gm 


I, Cl 
where g,,, = ar lege 


ee 
and » ( 0) is the covariance matrix of the estimated covariance 
parameters. If 


Vin 9 
LT (vm 2 2) 
J 2 

m=! Vim, — 


then the denominator degree of freedom is given by 


2E 
B-q 


y= 


Note that the degrees of freedom can only be computed when E>q. 


Type | and Type Ill Statistics 


Type I and type III test statistics are special cases of custom hypothesis tests. 


Estimated Marginal Means (EMMEANS) 


Estimated marginal means are special cases of custom hypothesis test. The construction of the 
matrix for EMMEANS can be found in “Estimated Marginal Means” section of GLM’s algorithm 
document. If Bonferroni or Sidak adjustment is requested for multiple comparisons, they will be 
computed according to the algorithm detailed in Appendix 10:Post Hoc Tests. 


Saved Values 
Predicted values are computed by 
¥=X8425 
Fixed predicted values are be computed by 
y=xX8 
Residuals are computed by 
b= = 


If standard errors or degrees of freedom are requested for predicted values, a L matrix will be 
constructed for each case and the formula in custom hypothesis section will be used to obtain 
the requested values. 
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Information Criteria 


Information criteria are for model comparison, the following criteria are given in smaller is better 
form. If we let | be the log-likelihood of (REML or ML), n be total number of cases (or total 

of case weights if used) and d is number of model parameters, the formula for various criteria 

are given as below, 


Akaike information criteria (AIC), Akaike (1974): 

—21 + 2d 

Finite sample corrected (AICC), Hurvich and Tsai (1989): 
Bayesian information criteria (BIC), Schwarz (1978): 

—21 + dx log (n) 

Consistent AIC (CAIC), Bozdogan (1987): 

—21+ dx (log(n) + 1) 


For REML, the value of n is chosen to be total number of cases minus number fixed effect 
parameters and d is number of covariance parameters. For ML, the value of n is total number of 
cases and d is number of fixed effect parameters plus number of covariance parameters. 


Derivatives of Log-Likelihood 


In each Newton or scoring iteration we need to compute the 1st and 2nd derivatives of the 
components of the log-likelihood /;, (¢), k=1,2,3. Here we let g,, = 2 1),(@) and H;, = 251),(6), 


ale) Oe? 
k=1,2,3, then the 1st derivatives with respect to the sth parameter in 0 is given by 


[gi], =tr (ve iV) 


[eo], =-rV7 (Vv) Vor 


fal. =~ (RV (.v) v-'R) 


OO. 


and the 2nd derivatives with respect to s and tth parameter are given by 
[Hi], = —tr (V-# (3 V) V- (38-V)) + tr (Vo gee V) 
(H2],,= 2rTV '(o-v)v 1(2v)Vv ir 

-artV-} (8-V) VERT (8-V) Vor 

—e?V-1 (5055-V) Vr 
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[Hs], = tr (XPV (2-v) v— (2-v) vx) 


—tr [ey (Vv) V-Xx!?v-1 (2 v) vx) 
-tr (XV (;2-v) vx) 


OO O04 


where X = XC fora matrix C satisfying CC’= (K'V-!X) = Pandr=y~— X38. 


Derivatives: Parameters in G 


Derivatives with respect to parameters in G can be constructed by from the entries of 


W.(X,X) W,(X,Z) W,(X,r) 
W,(Xir;Z) =| W,(Z,X) W,(Z,Z) Wy, (Z,r) 
Wi (r,X) Wy, (r,Z) WW, (r,r) 
X7’V7'X X?V'Z XTV-!r 
ZV''X ZVI'Z Z'vV-'r 
reVo'X rt VIZ or’ Vole 


The matrix W, (X;r;Z) can be computed from W(X; y; Z) given in “Cross Product Matrices’; 
by using the following relationship, 


r=y— Xbo 
where bo is the current estimate of (3. 


Using the above formula, we can obtain the following expressions, 
r?V-lr =y'lV-ly—y'V-!Xbo 
= Wi (y,y) — Wi (y, X) bo 
X?V-'r =X7V-ly —X7V-!Xbp 
= Wi (X,y) — W(X, X) bo 
ZV-'r = Z'V-ly —Z™V-!Xbo 
= W, (Z,y) — W, (Z,X) bo 


In terms of the elements in W, (X;r;Z), we can write down the 1st derivatives of /; , /2 and 
ls with respect to a parameter 6, of the G matrix, 


[gile,, = tr (Wi (Z,Z) 4G) 


[g2le., =—-Wi(Z,r)” (s-G) W; (Zr) 


06, 
Igsla,, = —!r (Wi (XZ) (;3-G) Wi (Z,X)P) 
For the second derivatives, we first define the following simplification factors 
Hz, = —Wi (Z.Z) (4:6) W) (Z,Z) (3-6) +W1(Z,Z) 2G 


06s OO; OO. 064 


Hg» = Wi (X,2) (-G) Wi (Z,r) 


MIXED Algorithms 


Hz, = 2W, (r,Z) (46) W, (Z,Z) (6) W, (Z,r) — Wy (r,Z) (aa) W, (Zr) 


OO; Os 
H3,, = W) (X,Z) (+s) W, (Z.X) 
Hz, = 2W;, (X,Z) (s--G) W, (Z,Z) (3-6) W, (Z,X) — W, (X,Z) (aaa) W, (Z,X) 
33 } OO, ’ OO; ’ ’ OO. 00% ’ 
then second derivatives of /;, /2 and/3 w.r.t. 6, and @, (in G) are given by 
[Ailes = tr (Hi) 
[Hale si = He» aa 2(H%,.)’ PH’, 


[Hs]. st =tr [Hg;P] i [Ho,PHe@;P| 


Derivatives: Parameters in R 


To compute R derivatives, we need to introduce the matrices 


(1)s IW, 
Wi ~ 08. 
and 
wi?) 07 Wo 

0 ae 


~ OA, OU; 


where @, and @; are the sth and tth parameters of R. Therefore, 
W,(A,B) =A™R™'B 


wy (A.B) =ATR?(R)R7B 
=-AT|¢R/B 


OO, ~ 


-n- (a/R) R" (s-R) Rp 
_-AT| 4 R71 ]B 
The matrices A and B can be X, Z, Z or r, where 
Z=ZM=2(G"!+Z7RZ)" 
and 


r= I —X(X/V-IX)'X'V"']y =y —Xbo 


Note: The matrix (G7! + ZT R-1Z)* involved in Z can be obtained by pre/post multiplying 
(1+ L7Z?R-!ZL) byLandL’). 


Using these notations, the 1st derivatives of /;, (@) with respect to a parameter in R are as follows, 
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[giln. = ir (Ro R) —tr (wi? (Z, Z)M) 


lesln,, = —Wo”* (rr) + 2Wo (r,Z) Wo” (Zr) 
—Wo (1.2) Wh!" (Z,Z) Wo (Z.r) 


[gslr,, = —'7 (AR) 


To compute 2nd derivatives w.r.t. 6, and @, (of R), we need to consider the following 
simplification factors. 


Hy, = -Ro'(g-R) Ro! (g-R) +R! (aR) 
—~W,,’* (Z,Z)M — Wi” (Z,Z) MW)" (Z,Z)M 

Hz, = Wi (X,r) + Wo (x.Z) wi Ds (7, Z)W (Z.r) 
Wy (x,Z) wi)" (Z,r) — WO" (X,Z) Wo (Z.r) 


x 
we 
I 
| 
S 


(2)st 


-W, (r. zZ) Wy” (Z,Z) Wo (Z.r) 
+2Wy (r. Z) wi" (Zr) 

=) [We & z) Wi” (Z, 

. [wo (Z,Z) Wo (Z.r 


HR; = w,* (X,X) — Wo (x ; z) wi" (Z, X) 
~w')* (X,Z) Wo (Z. x) 
+W, (x.Z) Ww) (Z,Z) Wo (z,x) 


H%,= —W," (X,X) 


Wy (x, z) w)" (ZZ) Wo (z.x) 

+2Wo (X,Z) wi" (Z,X) 

—2 [Wo (X,Z) Wh’ (Z,Z) — Wy" (X,Z)] M 
x [wo (Z,Z) Wo (z.x) — wi! (Z,X)] 


Based on these simplification terms, the entries of the Hessian matrices are given by 
[Hales =r (AR) 
[Ha)p st = HR» — 2(HR») "PHpp 


[Hale oy = tr (Hy,P — Hp;PHR;P) 
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G and R Cross-derivatives 


This section gives expressions for the 2nd derivatives of /;, /2 and /3 with respect to a parameter 
#, in G and a parameter @; in G. First, we introduce the following simplification terms, 


Hpi = -W)'”" (2,2) (4-6) 


OO, 


Lowi") (ZZ) (4-6) Wy (z.2) 
—w* (ZZ) Wo (Z. z) (3.6) Wy (z.z) 


Hap: = 2[W§)” (r,Z) — Wo (r,Z) MW)” (Z,2)| 


x [MW (2, Z) — 1] (%-G) [Wo (Z,Z)M - 1] 
x Wo (Z,r) | 


Hitns 2 | Wy)" (X,Z) — Wo (X,Z) MWj"" (Z,Z) 


x (MW, (Z, Z) — I] (4-G) [Wy (Z,Z)M —]] 
x Wo (Z, X) | 


Based on these simplification terms, the second derivatives are given by 
. st 
[Hiler,st = tr (Hép1) 
(H2] = Het, —2(H%.,)’ Ht 
2IGR,st GR27 -\4he2 R2 


[Halor st = tr (Hép3P = Hg;PHR;P) 


Gradient and Hessian of REML 


The restricted log likelihood is given by 


-Qreme (Oly) = log|V|+r7V—'r 
+ log X'V"'X| + (n — p)log 27 


where p is equal to the rank of X. Therefore the sth element of the gradient vector is given by 
le], = [g:), + [ga], + [gs]. 

and the (s,t)th element of the Hessian matrix is given by 

(A). = (A) ., [He], ! [Hs], 

If scoring algorithm is used, the Hessian can be simplified to 


[H].., = —[Hi),, u [H3) 4 


Gradient and Hessian of MLE 


The log likelihood is given by 
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Ty-l 
-Wure(@ly)= log\vj+r Vor 


tnlog 27 


Therefore the sth element of the gradient vector is given by 


[g], = [e:), + [ee], 


and the (s,t)th element of the Hessian matrix is given by 


s 


[H]., = [Hi] t 7 (He), 


If scoring algorithm is used the Hessian can be simplified to 


It should be noted that the Hessian matrices for the scoring algorithm in both ML and REML are 
not ‘exact’. In order to speed up calculation, some second derivative terms are dropped. Therefore, 
they are only used in intermediate step of optimization but not for standard error calculations. 


Cross Product Matrices 


During estimation we need to construct several cross product matrices in each iteration, namely: 
Wo (X:y;Z), Wi (X:y;Z), Wa (X;y;Z) , WA(X:y:Z), Wao (X;y), and Wy, (X;y). The 
sweep operator (see for example Goodnight (1979)) is used in constructing these matrices. 
Basically, the sweep operator performs the following transformation 


A B) | Av A-B 
BO CG —. _B/A- C—B‘A-B 
The steps needed to construct these matrices are outlined below: 
STEP 1: Construct 
X’R-X XTR-!y XTR-'Z 
Wo(X;y;Z)=|y7R'°X y'R ly y'R Z 
ZRoOX ZR-'Z yTR-ly 
STEP 2: 


Construct W,, (X;y;Z) which is an augmented version of Wy) (X;y;Z). It is given by he 
following expression. 


I+L7Z7R'ZL L7Ww (Z,-) 
A aero = 0 
Wo (X:y;Z) = Wo (4Z)L W, 


where L is the lower-triangular Cholesky root of G, ie. G=LLT and W) (Z;?;?) is the 
rows of Wp corresponding to Z. 


STEP 3: Sweeping W%* (X: y; Z) by pivoting on diagonal elements in upper-left partition will 
give us the matrix W‘\ (X;y;Z), which is shown below. 
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Writ) WA (1, 1)LTW(Z,:) 
—W, (-,Z) LW (1,1) W, (X:y;Z) 


Wt (X;y;Z) = 
where 
ROVER RIV HID XV ly 
W,(X;y;Z)= | Z7VX ZVv3Z ZPx-ly 
yo VX, y'V'Z yi’ Vly 
and 


W4(1,1) = (I+ L7Z?R-/ZL)~ 


During the sweeping, if we accumulate the log of the ith diagonal element just before ith sweep, 
we will obtain log |I+ L? ZR! ZL| = log |V| — log |R| as a by-product. Thus, adding to this 
quantity by log |R| will give us 1, (@). 


STEP 4: Consider the following submatrix Who (X;y) of Wi(X;y;Z), 


pe -1 i —1 
Win Ky) = [* v-'X XTV 4 


y Tv -1 xX y'V -1 y 
Sweeping Wp (X;y) by pivoting on diagonal elements of X’ V—!X will give us 


Tyy-1 >; 
W1 (X:y) = | Vv X) bo | 


bd ly (0) 
where bo is an estimate of Bo in the current iteration. After this step, we will obtain |. (?) and 


(3.0) = |X! VX] 
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MLP Algorithms 


The multilayer perceptron (MLP) is a feed-forward, supervised learning network with up to two 
hidden layers. The MLP network is a function of one or more predictors (also called inputs or 
independent variables) that minimizes the prediction error of one or more target variables (also 
called outputs). Predictors and targets can be a mix of categorical and scale variables. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


preey 


-(m) _ (_,.(m) (m) Input vector, pattern m. m=1....M. 
X = (2 Ip ) P P 


y™ — (ui" ) vr”) Target vector. pattern m. 
= (y) y 9 Pp 


yf Number of layers, discounting the input layer. 
Ji Number of units in layer i. Jo = P, J, = R, discounting the bias unit. 
rr Set of categorical outputs. 
r Set of scale outputs. 
Pr Set of subvectors of Y") containing 1-of-c coded hth categorical variable. 
ay Unit j of layer i, pattern m, j = 0,..., Ji;i = 0,..., I. 
Wi:j,k Weight leading from layer i-1, unit j to layer i, unit k. No weights connect 
a;" ;,; and the bias af": that is, there is no w;,j,0 for any/. 
cM, Sans 
2; Wizj,kQ;1.j- 1... 
j=0 
7i (c) Activation function for layer i. 
w 


Weight vector containing all weights (wi.0. 1W1:0,29 0 Wrsy 4,J, ). 
Architecture 

The general architecture for MLP networks is: 

Input layer: Jo=P units, ay.), +++. a0:7,3 With ap,; = 2). 


ae ‘ : Ji ; 
ith hidden layer: J; units,@i:1, °° +. i:7,; with @i:k = Yi (Cie) and Ci:k = SO“ '5) Wij. @i_a:7 where 


Lat 


Qi-1:0 = 1 


Jy 

Output layer: J[=R units,a;.),---,a7.7,5 with ay. = yr (cr-,) and cy, = S° W1.j,ki_ 1.3 where 
j=0 

“Qi-1:0 = 1 


Note that the pattern index and the bias term of each layer are not counted in the total number of 
units for that layer. 
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Activation Functions 


Hyperbolic Tangent 


7 (c) = tanh (c) — ae? 


eo +e‘ 
Sigmoid 
: 1 
~ 1+exp(-c) 
Identity 
yc) =e 


This is only available for output layer units. 


Softmax 
exp (CE 
7 (cy) = ew 
ys exp (¢;) 
jel, 


This is only available if all output layer units correspond to categorical variables and cross-entropy 
error is used. 


Error Functions 


Sum-of-Squares 


Er (w) = s Em (w) 
m=1 
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where 


) Oy... 
Em (w) = -\- yy log ae 


rer Yr 


This is only available if all output layer units correspond to categorical variables and the softmax 
activation function is used. 


Expert Architecture Selection 


Expert architecture selection determines the “best” number of hidden units in a single hidden 
layer. The hyperbolic tangent activation function is used for the hidden layer, and the identity 
function is used for the output layer (softmax if the output is categorical). 


A random sample is taken from the entire data set and split into training (70%) and testing 
samples (30%). The size of random sample is N = min(1000, memsize), where memsize is the 
user-controlled maximum number of cases stored in memory. If entire dataset has less than N 
cases, use all of them. If training and testing data sets are supplied separately, the random samples 
for training and testing should be taken from the respective datasets. 


Given Kmin and Kmax , the algorithm is as follows. 


1. Start with an initial network of k hidden units. The default is k=min(g(R,P),20,h(R,P)), where 


_f |45+VPFR| R<5,P>8 
A Se { [0.5 +0.5(P + R)| otherwise 
where | «| denotes the largest integer less than or equal tox. A(R, P) = [ws is the maximum 


number of hidden units that will not result in more weights than there are cases in the entire 
training set. 


If k < Kmin, set k = Kmin. Else if k > Kmax, set k = Kmax. Train this network once via the alternated 
simulated annealing and training procedure (steps 1 to 5). 


2. Ifk>Kmin, set DOWN=TRUE. Else if training error ratio > 0.01, DOWN=FALSE. Else stop and 
report the initial network. 


3. If DOWN=TRUE, remove the weakest hidden unit (see below); k=k-1. Else add a hidden unit; 
k=k+1. 


4. Using the previously fit weights as initial weights for the old weights and random weights for the 
new weights, train the old and new weights for the network once through the alternated simulated 
annealing and training procedure (steps 3 to 5) until the stopping conditions are met. 


5. Ifthe error on test data has dropped: 


If DOWN=FALSE, If k< Kmax and training error is dropped but the error ratio is still above 0.01, 
return to step 3. Else stop and report the network with the minimum test error. 


Else if DOWN=TRUE, If k> Kymin, return to step 3. Else, stop and report the network with the 
minimum test error. 
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Else if the error increased, If DOWN=TRUE, If k - ko|>1, stop and report the network with the 
minimum test error. 


Else if training error ratio for k= kg is bigger than 0.01, set DOWN = FALSE, k= ko return to 
step 3. Else stop and report the initial network. 


Else stop and report the network with the minimum test error. 


If more than one network attains the minimum test error, choose the one with a smaller number 
of hidden units. 


If the resulted network from above procedure has training error ratio (training error divided by 
error from the model using average of an output variable to predict that variable) bigger than 0.1, 
repeat above procedure with different initial weights until either the error ratio is <=0.1 or the 
procedure is repeated K times already, say K=5. If the procedure is repeated K times, pick the 
one with smallest test error. 


The weakest hidden unit 


For each hidden unit j, calculate the error on the test data when j is removed from the network. 
The weakest hidden unit is the one having the smallest total test error upon its removal. 


Training 


Given the training type (online, batch, or mini-batch), the problem of estimating the weights 
consists of the following parts: 


> Initializing the weights. Take a random sample (as described in “Expert Architecture 
Selection”) and apply the alternated simulated annealing and training procedure on the random 
sample to derive the initial weights. Training in step 3 is performed using all default training 
parameters. 


> Computing the derivative of the error function with respect to the weights. This is solved via 
the error backpropagation algorithm. 


> Updating the estimated weights. This is solved by the gradient descent or scaled conjugate 
gradient method. 


Alternated Simulated Annealing and Training 


The following procedure uses simulated annealing and training alternately up to Kj times. 
Simulated annealing is used to break out of the local minimum that training finds by perturbing 
the local minimum K2 times. If break out is successful, simulated annealing sets a better initial 
weight for the next training. We hope to find the global minimum by repeating this procedure K3 
times. This procedure is rather expensive for large data sets, so it is only used on a random sample 
to search for initial weights and in architecture selection. Let K1=K2=4, K3=3. 


1. Randomly generate K> weight vectors between [ag—a, agt+a]. This is a user controllable interval 
with default ag7=0 and a=0.5. Calculate the training error for each weight vector. Pick the weights 
that give the minimum training error as the initial weights. 
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2. Set ky=0 
3. Train the network with the specified initial weights. Call the trained weights w. 


4. If the training error ratio <= 0.05, stop the k; loop and use w as the result of the loop. Else set 
ky =ky+1. 


5. If k1<Kz1, perturb the old weight to form K2 new weights y,’ — w 4 WwW, by adding Ko different 
random noise ,, between [a(k1), a(k1)] where g ky) = (0.5)"tar Let w,,, be the weights that 
give the raininun training error among all the pertur ed weights. If Ey eas < Er (w)? Set the 
initial weights to be w,,jn, return to step 3. Else stop and report w as the final result. 


Else stop the kj loop and use w as the result of the loop. 
If the resulting weights have training error ratio bigger than 0.1, repeat this algorithm until either 


the training error ratio is <=0.1 or the procedure is repeated K3 times, then pick the one with 
smallest test error among the result of the k; loops. 


Error Backpropagation 


Error-backpropagation is used to compute the first partial derivatives of the error function with 
respect to the weights. 


1 —|y (c)]° tanh 
First note that 7" (c) = ¢ y(c) (1-4 (c)) sigmoid 
1 identity 


The backpropagation algorithm follows: 
For each i,j,k, set ea =) 


For each m in group T; For each p=1.,...,Jj, let 


om OE 


a ty —ys” if cross-entropy error is used 
Ip ~ Oey, = 


F 
11 (¢ chp) (a = yp" ) otherwise 


For each i=I,...,1 (start from the output layer); For each j=1....,Jj; For each k=0,...,Jj-1 


OE, —_ sm _m sm _ JEm 
Pm Let Fo = OOF ke where 37", = Ber 
OB JEp_ , JE» 
> Set OW ) ~~ Ou ky Ow; ; } 
J; 
7 m ,7 v7 A 7 
Pm Ifk>Oandi>1, sets”, = 41 (Cm in) >, Wick) 
j=l 
I-1 


This gives us a vector of ys (J; + 1) J;,, elements that form the gradient of Ey (w,). 
i=0 
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Gradient Descent 


Online or Mini-Batch 


Given the learning rate parameters 7) and 7;,.,,, MOMentum rate o, and learning rate decay factor 
B, the gradient descent method for online and mini-batch training is as follows. 


1. Let k=0. Initialize the weight vector to wo, learning rate to jy. Let Aw) =0. 
2. Read records in 7;, (T;, is randomly chosen) and find Fy, (w),) andits gradient g, = V Ey, (w;,,). 


3. If nelgiel < oe|Awy|? a = 0.9 i a . This step is to make sure that the steepest gradient descent 
direction dominates weight c diige in next step. Without this step, the weight change in next step 
could be along the opposite direction of the steepest descent and hence no matter how small nk is, 


the error will not decrease. 
4. Let v = w, — m9, + aAw,. 


5. If Ep, (v) < Ep, (w;), then set wes, =v and Aw, .) = wpy1 — wy, Else 
Wriy = We, Awpsi, = Awe. 


Whe a k+l < Mow Mke+1l = Mlow 
en. If , then set : 


7. Ifa stopping rule is met, exit and report the network as stated in the stopping criteria. Else let 
k=k+1 and return to step 2. 


Batch 


Given the learning rate parameter 7) and momentum rate o, the gradient descent method for 
batch training is as follows. 


1. Let k=0. Initialize the weight vector to wo, learning rate to 7). Let Awo =0. 


2. Read all data and find F r (wp) and its gradient gn = VE (wr): If \gk 
the current network. 


< 107°, stop and report 


3. If melgn| < ol Aw;,) a = 0.9, \v:|_. This step is to make sure that the steepest gradient descent 
: is af e a : Z : hie 7 Aw; ls : : : : 
direction dominates weight ch inge in next step. Without this step, the weight change in next step 
could be along the opposite direction of the steepest descent and hence no matter how small nk is, 


the error will not decrease. 


4. Letv=we —egpr +adu, 


5. If Ey (v) < Ey (w,), then set wp4) =v, Away = wei. — Wes aNd p41 = Ny» Else 1, = -57, and 
return to step 3. 


6. If astopping rule is met, exit and report the network as stated in the stopping criteria. Else let 
k=k+1 and return to step 2. 
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Scaled Conjugate Gradient 


This method is only available to batch training. To begin, initialize the weight vector to wo, and 
let N be the total number of weights. 


1. k=0. Choose scalars 0 < \y < 10-°,0 <a < 1074, Ay = 0. Set rg = po = —VEr (wo), and 
success=true. 


2. If success=true, find the second-order information: ,, — Dy 8k = YEriwaroxpa)—V Pala), 
. " Pk ok 
dj = Pj,Sk, Where the superscript t denotes the transpose. 


3. Set On — On t (Ap. — Ar) |pi.|?. 
4. If 5, < 0» Make the Hessian positive definite: \, — 2( es wr): bk = —Oe + AnlPK|2> Ay = Az 
2. Calculate the step size: [fh = p.Vh, A = rae 


6. Calculate the comparison parameter: A), — 25, 22we)="1(wevoxpy)I 
Uk 


7. IfA, >0 ,errorcan be reduced. Setwii1 = Wr + OxnPR, Ver = —VEr (wes), If 
rx si] < 107°, return w;,,; as the final weight vector and exit. Set Ax = 0, success=true. If k mod 
N=0, restart the algorithm: p;..; = rj. , else set 3; = Li ieee Se Pktl =Vrsi 4 By pr Tf 
A; > .75, reduce the scale parameter: A), = i Xb else (if Ay < 0): Set Ax = Ax, success=false. 


8. If A, < .25, increase the scale parameter: )\, — \,, + #U=4«), 


9. If success=false, return to step 2. Otherwise if a stopping rule is met, exit and report the network 
as stated in the stopping criteria. Else set k=k+1, Av 1 = Az, Ans) = Ap and return to step 2. 


Note: each iteration of batch training requires at least two data passes. 


Stopping Rules 


Training proceeds through at least one complete pass of the data. Then the search should be 
stopped according to following criteria. These stopping criteria should be checked in the listed 
order. For batch training, check of any stopping criteria is performed after completion of an 
iteration. For online or mini-batch training, check of any of stopping criterion 1, 3, 4, and 5 is 
performed after completion of a data pass, only check of criterion 2 is performed after an iteration. 
Let step mean a data pass for online and mini-batch methods, an iteration for batch method. Let 
E, denote the current minimum error and K, denote the step where it occurs for training data, E> 


and K2 are that for testing data, and K3=min(K1,K2). 


1. Ifthere is no testing dataset and the training method is online or mini-batch, compute the total 
error for training data at the end of each step. From step Ky, if the training error does not decrease 
below EF) over the next n steps, stop. Report the weights at step K1. If there is a testing dataset, 
users have the following options: 


Check testing data only: at the end of each step compute the total error for testing data. From 
step Ko, if the testing error does not decrease below E> over the next n steps, stop. Report the 
weights at step Kp. 
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Check both training and testing data: at the end of each step simultaneously check the total error 
for training and testing data. From step K, for training and step K> for testing, if either training 
or testing error does not decrease below its current minimum over the next n steps, stop. Report 
the weights at step K3. Notice that for batch method there is no need to check the total error for 
training data because a decrease in total error for training data is guaranteed by the algorithm. 


2. The search has lasted beyond some maximum allotted time. For batch training, report the weights 
at step K3. For on-line or mini-batch training, even though training stops before the completion of 
current step, treat this as a complete step. Calculate current errors for training and testing datasets 
and update F1, K1, Eo, Kp correspondingly. Report the weights at step K3. 


3. The search has lasted more than some maximum number of data passes. Report the weights 
at step K3. 


4. When current training error is the minimum (£, = FE (w;,), always true for batch), stop if the 


relative change in training error is small: CEE ae 3 < 41 for é = 107'° and ¢,, where 
sZlbT(w,)+er(wR,_-1)t+¢ 


wy—1, wy, are the weight vectors of two consecutive steps. Report weights at step K3. 


5. The current training error ratio is small compared with the initial error: | #r(wx) I< -, for 
an 


: 2 
ae 10~'° and ¢»» Where E’y is the total error from the model using the averdge’o output 


m. 


= M 
variable to predict that variable; E’y is calculated by using a7! = 1 S- yo) in the error function, 
M oT 


where w, is the weight vector of one step. Report weights at step Ky7=! 


Note: In criteria 4 and 5, the total error for whole training data is needed. For batch training, the 
error is always calculated, but for online or mini-batch training, error is not available without 
passing the training data one more time. So for online and mini-batch training, criterion 4 and 5 
will not be checked if user decides to use testing data only in criterion 1. 


Missing Values 


Missing values are not allowed. 


Output Statistics 


The following output statistics are available. Note that, for scale variables, output statistics are 
reported in terms of the rescaled values of the variables. 


Sum-of-Squares or Cross Entropy Error 


As described in “Error Functions”. The cross entropy error is displayed if the output layer 
activation function is softmax, otherwise the sum-of-squares error is shown. 
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Relative Error 


For each scale target r: 


M (m) a(m) 2 
eee Ur — Ur 
M (m) _ 2 
Det Ur = vr) 


For each categorical target r, report p,, the percent of incorrect predictions 


Average Overall Relative Error 


If there is at least one scale target: 


M R 
» Ss (x m) fe)" 
Ur — y J 

=l1 

I 
m) 2 

yy 

mm) 


where 7, is the mean of yj,’ over patterns. 


Satan 


m=lr=1 


If all targets are categorical, report the average percent of incorrect predictions: 
C 
1 
Ca LPr 
r=1 
where C is the number of categorical variables. 


Sensitivity Analysis 


For each predictor p and each input pattern m, compute: 


¢ nae ; *-(m) (m) 
dpm = MaXz,, ,x,,€S, Yo.” — i | 
where Y;,,"’ is the predicted output vector (standardized if standardization of output 
hee (7m) : lh” 2h™) f % ry 
variable is used in training) using peep Bp) Lpes Epp as ees p ) as its input, and S,= 
tee re ah), eo ct for eee vy 1,0,...,0),(0,1,0,...,0),..., (0,0, ..., L)} for 


categorical predictors 


Then compute: 


dy = M a dpm 


m= 
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and normalize the d,s to sum to 1, and report these normalized values as the sensitivity values for 
the predictors. This is the average maximum amount we can expect the output to change based 
on changes in the pth predictor. The greater the sensitivity, the more we expect the output to 
change when the predictor changes. 
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MULTIPLE CORRESPONDENCE 


Algorithms 


Multiple Correspondence Analysis, also known as homogeneity analysis, quantifies nominal 
(categorical) data by assigning numerical values to the cases (objects) and categories, such that 
in the low-dimensional representation of the data, objects within the same category are close 
together and objects in different categories are far apart. Each object is as close as possible to 
the category points of categories that apply to the object. In this way, the categories divide the 
objects into homogeneous subgroups. Variables are considered homogeneous when they classify 
objects that are in the same categories into the same subgroups. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n 


Nw 


Mtot 


Number of analysis cases (objects) 


Weighted number of analysis cases: pS Wi 
i=1 
Total number of cases (analysis + supplementary) 
Weight of object i; w; = 1 if cases are unweighted; w; = 0 if object iis 
supplementary. 
Diagonal nio: X Mio, Matrix, with w; on the diagonal. 
Number of analysis variables 


Weighted number of analysis variables (mus = S- vy) 
j=1 
Total number of variables (analysis + supplementary) 


The data matrix (category indicators), of order not X Mot, after 
discretization, imputation of missings, and listwise deletion, if applicable. 


Number of dimensions 


For variable j; 7 =1, .... 1tyu1 


Variable weight; 1; = 1 if weight for variable j is not specified or if variable 
j is supplementary 


Number of categories of variable j (number of distinct values in h,, thus, 
including supplementary objects) 


Indicator matrix for variable j, of order niot x k; 


The elements of G; are defined asi = 1, ....1j;r =1,...,h; 
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__ J 1 when the ith object is in the rth category of variable j 
I()ir 0 when the ith object is not in the rth category of variable 7 


D; Diagonal /; x /j; matrix, containing the weighted univariate marginals; ie, 
the weighted column sums of G; (D;= G ;WG);) 
M; Diagonal n,.¢ * mo, Matrix, with diagonal elements defined as 


0 when the ith observation is missing and missing strategy variable 7 is passive 

0 when the ith object is in rth category of variable 7 and rth category is only 
~ used by supplementary objects (i.e. whend,;),,. = 0) 

v; otherwise 


M. =,jM, 


The quantification matrices and parameter vectors are: 


x Object scores, of order rior x p 

Xw Weighted object scores (X,,, = WX) 

xh X normalized according to requested normalization option 
Y; Category quantifications, of order kj x p. 


Note: The matrices W, G;, M;, M,, and D; are exclusively notational devices; they are 
stored in reduced form, and the program fully profits from their sparseness by replacing matrix 
multiplications with selective accumulation. 


Discretization 
Discretization is done on the unweighted data. 
Multiplying 


First, the original variable is standardized. Then the standardized values are multiplied by 10 and 
rounded, and a value is added such that the lowest value is 1. 


Ranking 
The original variable is ranked in ascending order, according to the alphanumerical value. 


Grouping into a specified number of categories with a normal distribution 


First, the original variable is standardized. Then cases are assigned to categories using intervals 
as defined in Max (1960). 
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Grouping into a specified number of categories with a uniform distribution 


First the target frequency is computed as divided by the number of specified categories, rounded. 
Then the original categories are assigned to grouped categories such that the frequencies of the 
grouped categories are as close to the target frequency as possible. 


Grouping equal intervals of specified size 


First the intervals are defined as lowest value + interval size, lowest value + 2*interval size, etc. 
Then cases with values in the kth interval are assigned to category k. 


Imputation of Missing Values 


When there are variables with missing values specified to be treated as active (impute mode or 
extra category), then first the é:;’s for these variables are computed before listwise deletion. Next 
the category indicator with the highest weighted frequency (mode; the smallest if multiple modes 
exist), or }:; + 1 (extra category) is imputed. Then listwise deletion is applied if applicable. And 
then the /:;’s are adjusted. 


Configuration 


MULTIPLE CORRESPONDENCE can read a configuration from a file, to be used as the initial 
configuration or as a fixed configuration in which to fit variables. 


For an initial configuration see step 1 in the Optimization section. 


A fixed configuration X is centered and orthonormalized as described in the optimization section in 
step 3 (with X instead of Z) and step 4 (except for the factor n i *, and the result is postmultiplied 
with A!/? (this leaves the configuration unchanged if it is already centered and orthogonal). The 
analysis variables are set to supplementary and variable weights are set to one. Then MULTIPLE 
CORRESPONDENCE proceeds as described in the Supplementary Variables section. 


Objective Function Optimization 


The MULTIPLE CORRESPONDENCE objective is to find object scores X and a set of Y ; (for 
j=1,...,m) — the underlining indicates that they may be restricted in various ways — so that 
the function 


o(X:¥) = (nwp) *St((X —G;Y;) Mj W(X — GX) 


Jj 


is minimal, under the normalization restriction X M,WX = 1. 1wl (Lis the pxp identity 
matrix). The inclusion of M; in o(X; Y) ensures that there is no influence of passive missing 
values (missing values in variables that have missing option passive, or missing option not 
specified). M., contains the number of active data values for each object. The object scores are 
also centered; that is, they satisfy u’M,. WX = 0 with u denoting an n-vector with ones. 


MULTIPLE CORRESPONDENCE Algorithms 
Optimization is achieved by executing the following iteration scheme: 
1. Initialization 
2. Update category quantifications 
3. Update object scores 
4. Orthonormalization 
5. Convergence test: repeat (2) through (4) or continue 


6. Rotation 


These steps are explained below. 


Initialization 


If an initial configuration is not specified, the object scores X are initialized with random numbers. 
Then X is orthonormalized (see step 4) so that u.M,WxX = 0 and X M, WX = nymyI, 
yielding X;. 


Update Category Quantifications; Loop Across Analysis Variables 

With fixed current values X_;, the unconstrained update of Y; is 

Y=), "64x; 

Update Object Scores 

First the auxiliary score matrix Z is computed as 

Z< Xj;M;G;Y} 

and centered with respect to W and M,: 

X* = (I-M,uu'W/(u'M., Wu) )Z 

These two steps yield locally the best updates when there would be no orthogonality constraints. 


Orthonormalization 


To find an M,-orthonormal X* that is closest to X* in the least squares sense, we 
use for the Procrustes rotation (Cliff, 1966) the singular value decomposition 


mi? /?wt/2x* = KAY2L), then yields ni/?m)/?M_*/?W1/2KL’ -orthonormal 
weighted object scores: X,, <nv/?m,M;tWX*LA~!/2L, and X* = W71X+. The 
calculation of L and A is based on tridiagonalization with Householder 
transformations followed by the implicit QL algorithm (Wilkinson, 1965). 
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Convergence Test 


The difference between consecutive values of the quantity 


TFIT = (ig), vjtr (y',D,Y;} 


is compared with the user-specified convergence criterion ¢ - a small positive number. It can 
be shown that TFIT = m,,, — 7(X;Y). Steps (2) through (4) are repeated as long as the loss 
difference exceeds s. 


After convergence TFIT is also equal to tr (A'/*), with A as computed in step (4) during the last 
iteration. (See also Model Summary, and Correlations Transformed Variables for interpretation 
of A‘/?). 


Rotation 
To achieve principal axes orientation, X* is rotated with the matrix L. Then step (2) is executed, 
yielding the rotated quantifications. 

Supplementary Objects 


To compute the object scores for supplementary objects, after convergence steps (2) and (3) are 
repeated, with the zero’s in W temporarily set to ones in computing Z and X*. If a supplementary 
object has missing values, passive treatment is applied. 


Supplementary Variables 


The quantifications for supplementary variables are computed after convergence by executing step 
(2) once. 


Diagnostics 


The following diagnostics are available. 


Maximum Rank (may be issued as a warning when exceeded) 


The maximum rank pmax indicates the maximum number of dimensions that can be computed 
for any dataset. In general 


Pmax = HUN (x = Ls oar J ky) = m) 


if there are no variables with missing values to be treated as passive. If variables do have missing 
values to be treated as passive, the maximum rank is 


Pmax = min (n —1, esree ky) —max(m1, 1)) 
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with m1 the number of variables without missing values to be treated as passive. 

Here k:; is exclusive supplementary objects (that is, a category only used by supplementary objects 
is not counted in computing the maximum rank). Although the number of nontrivial dimensions 
may be less than pmax when m=2, MULTIPLE CORRESPONDENCE does allow dimensionalities 
all the way up to Pmax. When, due to empty categories in the actual data, the rank deteriorates 
below the specified dimensionality, the program stops. 


Descriptives 


The descriptives tables gives the weighted univariate marginals and the weighted number of 
missing values (system missing, user defined missing, and values less than or equal to 0) for 
each variable. 


Fit and Loss Measures 


When the HISTORY option is in effect, the following fit and loss measures are reported: 
Fit (VAF). This is the quantity TFIT as defined in step (5). 


Loss. This iss(X;Y). 


Model Summary 


Model summary information consists of Cronbach’s alpha, the variance accounted for, and the 
inertia. 


Cronbach’s Alpha 
Cronbach’s Alpha per dimension (s=1.,...,p): 
As = My es ae 1) / (r3 2 (my — 1)) 
Total Cronbach’s Alpha is 
a= My (Zsa! a 1)/Z,A} 2 (my — 1) 


with \, the sth diagonal element of A as computed in step (4) during the last iteration. 


Variance Accounted For 


Variance Accounted For per dimension (s=1.,...,p): 


VAF, = ve utr (¥().Payu ). (% of variance is VAF1, x L00/im.,), 


Jed 
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Eigenvalue per dimension: 
\t/"=VAF,, 


with \, the sth diagonal element of A as computed in step (4) during the last iteration. (See also 
Optimization step (5), and Correlations Transformed Variables for interpretation of A!/”). 


The Total Variance Accounted For is the mean over dimensions. So, the total eigenvalue is 


tr (A!/*) =p-!SVAF,. 


y 


If there are no passive missing values, the eigenvalues A'/? are those of the correlation matrix 
(see the Correlations and Eigenvalues section) weighted with variable weights: 


We pes Wi pW _,!/2,,. 
Tig =U; jj, and r= TG = Uy Mil 


If there are passive missing values, then the eigenvalues are those of the matrix 7, »Q cM. Qe, 
with Qc = ng? (1 —M uu W / (uM. Wu] Q, (for Q see the Correlations and Eigenvalues 
section) which is not necessarily a correlation matrix, although it is positive semi-definite. This 
matrix is weighted with variable weights in the same way as R. 


Inertia 
The inertia per dimension is the eigenvalue per dimension divided by m,,. The total inertia is 


the total eigenvalue divided by m,,,. 


Correlations and Eigenvalues 


Before transformation 


R =n'H cWHc, with Hc weighted centered and normalized H. For the eigenvalue 
decomposition of R (to compute the eigenvalues), first row j and column j are removed from R if j 
is a supplementary variable, and then ;;; is multiplied by (;v;)'/” 


If passive missing treatment is applicable for a variable, missing values are imputed with the 
variable mode, regardless of the passive imputation specification. 


After transformation 


After transformation, p correlation matrices are computed (s=1.,...,p): 


Sen Ve 
Ris) =, Ney Q (s)WQ),), 


9 


\-1 
: 1/2 ' 
with aj = nu? Gj¥(y4(¥ysDj¥ ips) 


Usually, for the higher eigenvalues, the first eigenvalue of R;,) is equal to ay a (see Model 
Summary section). The lower values of A'/? are in most cases the second or subsequent 
eigenvalues of R,.). 
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If there are missing values, specified to be treated as passive, the mode of the quantified variable 
or the quantification of an extra category (as specified in syntax; if not specified, default (mode) 
is used) is imputed before computing correlations. Then the eigenvalues of the correlation 
matrix do not equal A!/? (see Model Summary section). The quantification of an extra category 
is computed as 


—1 
Tighe (> «] S Wi Liss 
i‘ ier ie. 


with J an index set recording which objects have missing values. 


For the eigenvalue decomposition of R (to compute the eigenvalues), first row j and column j are 
removed from R if j is a supplementary variable, and then r;; is multiplied by (0; v;)' ns 


Discrimination measures 


The discrimination measures are the dimensionwise variances of the quantified variables, which 
are equal to the dimensionwise squared correlations of the quantified variables with the object 
scores. For variable j and dimension the discrimination measure is 


, 


7 — ml 
Discr;., =Ny ¥ (j \sDj¥(j)s 


which is equal to the squared correlation between Gjy,;), and x,. 


Object Scores 


If A'/? gives the eigenvalues, then A'/" gives the singular values, that can be used to spread the 
inertia over the object scores X and the category quantifications Y. During the optimization phase, 
variable principal normalization is used, then X" — X and Y" = Y, else X" = X(m-!A)" “and 
Y= Y¥ (mata) 1 A(0~1) with a=(1+q)/2, b=(1-q)/2, and q any real value in the closed interval 
[-1,1], except for independent normalization: then there is no q value and a=b=1. q=1 is equal to 
variable principal normalization, g=-1 is equal to object principal normalization, q=0 is equal to 


symmetrical normalization. 


Mass 
The mass of object i is 


Mass, = —2=it — 
* tM. w) 


Inertia 


The inertia of object i is 


; vj 
Inertia, = —~— /__ _ Mass, 
4 Niwai d 7 


JhijgAO (hig 
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where d(;);,,, is the frequency of the category of object i on variable j, and ;; # Gndicates to 
exclude a variable if object i has a missing value on the variable and the missing option for 
the variable is passive. 


Contribution of point to inertia of dimension 


The contribution of object i to the inertia of dimension s is 


Maji 


Contribution;, = “~~ 


Contribution of dimension to inertia of point 


The contribution of dimension s to the inertia of object i is 


m 


Contribution _ —+~Inertia. 
2S Se Se - S 


Inertia’ 


Quantifications 


The quantifications are the centroid coordinates. If a category is only used by supplementary 
objects (i.e. treated as a passive missing), the centroid coordinates for this category are computed 
as 


1/2, -1 1/4(b-1 
Y(jjr =Nw 25, ) xj;A ) 
ie. 


where y,j), is the rt! row of Yj, ,, is the number of objects that have category r, and I is an 
index set recording which objects are in category r. 


Mass 


The mass of category r of variable j is 


Inertia 


The inertia of category r of variable j is 


n 


) WIM xii G(j)ir 


Inertia, ;),. = =! 


7 Mass, jy, 


(j)rr 


1 


if there are no missing values with missing option passive, this is equal to —— — Mass, ;),, and 
w 


j)r? 
Fie 1) 


then the total inertia for variable j is wiki) 
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Contribution of point to inertia of dimension 


The contribution of category r of variable j to the inertia of dimension s is 


y~ 


d( 4) pr 


(j)rr 


Inertia, 


Contribution, ;),. = 


the total contribution of variable j to the inertia of dimension s is “ Discr, : 


Contribution of dimension to inertia of point 


The contribution of dimension s to the inertia of category r of variable j is 


u 


dis)pr 

Contribution, ,,,. = ~2 tees 
s(g)r 

Inertia,;),. 


Residuals 


Plots per dimension are produced of G ae ,, against the approximation xi. 
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Multiple Imputation Algorithms 


Multiple imputation imputes missing values multiple times. This algorithm only considers the 
imputation phase. See “Multiple Imputation: Pooling Algorithms” for the algorithm for 
combining analysis results of multiply imputed data sets. 

Univariate and multivariate methods are given here. Univariate methods are used in situations 
where only the variable to be imputed has missing values, and all variables used as predictors 
in the imputation have no missing values. Multivariate methods are used in situations where 
variables are used both as dependents and predictors during imputation. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


x Set of variables that have no missing values. 
Vij The data value for case i, variable j. It may be missing. 
al The collection of observed values of variable j. 
yints The collection of missing values of variable j. 
y (yer yee) The collection of all observed data. 
= 1 Rea baa A 
eed —ml 9 asain) (a ead The collection of all missing data. 
fi Frequency (replication) weight for case i. Must be integer. 
F = diag (fi,.-5 fr) Frequency weight matrix, diagonal with case frequency weight on the 
diagonal. 
wy Regression or analysis weight for case i. 
W = diag (wi,..., wn) Regression weight matrix. 
us The total number of cases. Each case may represent more than one 
observation due to frequency (replication) weights. 
- iz The total number of observations. 
N= if 
i=1 
m The total weight. 
Nw =) wifi 


w=1 


Univariate Methods 
y: the variable to be imputed, has missing values. 


x: predictor variables, no missings. 
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Linear Regression 


The variable to be imputed, y, is a continuous variable and is to be used as the dependent variable 
in the regression model. Both frequency and regression weights are accepted. 


Model y; = x 6+ e; with e; ~ N(0, =) is used. 
Prior: Pr (3,logo?) « 1, or equivalently Pr (3,07) « 1/o? 


Using the complete cases, fit the regression model, assuming that all redundant parameters are 
removed if there are any. Denote the fitted parameters as (4, a) such that 


B = (XLF.WeXe) XLF-WeYe 
6? = (Yo- XoB) FeWe (Yo — Xe8) / (Noos — P) 


where Nios = >, obs(Y) f; is the number of complete cases, p is the number of parameters, and 
Y., X., F., W. are the dependent vector, design matrix and frequency weight, regression weight 
matrix for complete cases. 


The posterior distributions are: 


9 a ft . - —1 ‘ 
o2,Y2,Xo~N (3, (x ‘EM Xe) °°) 


aoe Xe NS (Nobs iz Pp) 67 /XN...—p 


Let A be the upper triangular matrix of Cholesky decomposition ( XPM iKe) =A'A. 


Draw parameters from the posterior distributions. 


> Draw(o*)” : draw a random value u from \None— p then (o* )? = (Nobs— p) 6?/u. 


> Draw 3° : draw p independent N(0,1) values to create a random vector v, B* = 8 + 0*A'V. 
oO * 


wi 


Impute missing values. For i in mis(Y), draw *: from N(0,1); imputation is y;/ = x;B* + Zi- 
Repeat the drawing of parameters and imputation of missing values to generate multiple 
imputations. 


Incorporate restrictions 


Using the linear regression method, a continuous variable may have an imputed value well outside 
the range of observed values, so the imputed values of continuous variables can be restricted to 
fall within a user-specified range, R. When an imputed value falls outside R, the algorithm draws 
another imputed value until a value is drawn within R or r, draws have been made (the maximum 
number of tries allowed for drawing each missing case under the given parameter). If the 1; limit 
is reached, a new set of parameters are drawn from the posterior distributions (discarding any 
successfully imputed values for this variable during this imputation) and the process of imputing 
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missing values is repeated until a set of imputed values is obtained for this variable and this 
imputation or r. sets of parameters have been drawn (the maximum number of tries allowed for 
drawing parameters). 


If the r. limit is reached, the algorithm stops and issues an error. 


Predictive Mean Matching 
This is the same as the linear regression method, but with the following changes. 


Replace the impute missing values step of linear regression by the following: 


Calculate Y”"* = { obs = x,0" :7 € obs (y)} For i inmis(Y): 
r Y= x; 3". 


> Among ys, find the observation whose corresponding predicted values are closest toy’; 


> Pick that one as the imputation. 


Logistic Regression 


The variable to be imputed, y, is a categorical variable with K categories taking values 1, 2, 
..., K, and is used as dependent variable in the logistic regression model. In the following, 
px(k) =Pr(y= k|x), and pilk) = px, (hk) for case i. 


Pai =x 6, fork=1,..., K-1. 


Model: log 


Prior: Pr (3) « 1, where 6 = (Pine caa) 

Using the complete cases, fit the logistic regression model with user specified frequency and 
analysis/regression weight variables. Denote the fitted parameter vector and its variance matrix as 
B,V (3). The posterior distribution is approximately: BlYe,Xe~N (4, v), Let A be the upper 
triangular matrix of the Cholesky decomposition. V = A'A. 


Draw parameters from the posterior distributions: draw 3*: draw length (3) independent N(O0,1) 
values to create a random vector z, then 5* = 6+ Az. 


Impute missing values. For i in mis(Y): 


q 
exp (x, 3; ) 
1+ halt, exp(x! 37) 


44 j= j 


> Calculate p;(h) = fork =1,---,4 —1. 


> Draw arandom value u from uniform distribution [0,1]. 


1 u < P; (1) 
> Imputation is y* k P;(k—1)<u< P;(k) where P; (k) eae pi(j). 
K u> P;(K —1) 
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Repeat the drawing of parameters and imputation of missing values to generate multiple 
imputations. 


Multivariate Methods 


Multivariate methods apply to situations in which multiple variables have missing values. Patterns 
of missing values are important here because a fast non-iterative procedure can be used for 
monotone missing patterns. For general missing patterns, the fully conditional specification (FCS) 
is available. This is an iterative MCMC method. 


Monotone Method 


Missing patterns are monotone if the variables can be ordered such that, for each case, all earlier 
variables are observed if the later variable is observed. This method also assumes that the 
parameters of individual imputation models have independent priors. 


Let Y,,..., ¥x be variables with missing values in the sorted monotone order such that Y; has the 
smallest number of missing values. Let X be the set of variables without missing values. Starting 
from Yj, sequentially use univariate method with the previous Y variables and X variables as 
predictors to impute. 


> Given X, Y,’’* and imputation model for Y,, impute Y;""* by univariate method m times to get m 
complete variable y;"""’,...,¥°\". 

> Forl=1,...,m, given X,Y,\",¥¥?* and imputation model for Y, impute Y""* by univariate 
method once to get y,". 


Pr On = dong BBE A, ie ee ye’ and imputation model for Y;, impute Y;)”"* by 
univariate method once to get ¥ a 


> Continue until last variable Y;""* is imputed. 


Notes: 


m= The imputation model for variable Y; can only use variables from X,Y, ..., ¥j—1 as predictors. 
In the case of no X variables, a constant model for Y; is used. 


m= The posterior distribution used to draw parameters for imputing bt * doesn’t depend on 


previously imputed values {Ye ? I a ite 7 - 


Fully Conditional Specification (FCS) 


In this method, an imputation model for each variable with missing values is specified. This 
method is an iterative MCMC procedure. In each iteration, it sequentially imputes missing values 
starting from the first variable with missing values. 


~(O) ~(0) 


> Set initial values for missing values in all variables Y,"’, ..., Yj, 


(see below). 
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P At iteration t, for j = 1to K: Given X,Y"... ¥!0 yO) yy"); that is, the most recentl 
J 1 j-1 j+1 kK y 
imputed values of all other variables, ¥, ‘ae Yo") for j =1, and X,Y; wig Veg LOL) = 1G 
use a univariate method to impute all missing values in the jth variable, a 2 


> Continue iterations until the maximum number of iterations is reached. 


We create multiple imputations by the multiple chain method; that is, we repeat above steps m 
times to get m imputations. Each chain starts with a different seed for random numbers and 
different initial values. 


Initial Values 


For a continuous variable with missing values, use the non-missing values to find its sample 
mean and standard deviation, then fill in the missing values with random draws from a normal 
distribution with mean and standard deviation equal to the sample values, limited within the range 
of the observed minimum and maximum values. 


For a categorical variable with missing values, use the non-missing values to find the observed 
proportion of each category, then fill in the missing values with random draws from a multinomial 
distribution with category probabilities equal to the observed category proportions. 


Assessment of Convergence 


For each imputation and each iteration, missing values are imputed for each variable. Let 
-emis(L, ~ rmi ws o . 7 
y" "be the vector of imputed values of ¥/""* at iteration t, imputation I. For each (I, 0), 


calculate the sample mean and standard deviation of Y; 


Lt -emis(L, 
m' ) =néan e +mis(l "’) 


~emis(1.t), 


J 
(lt) : -«mis(l,t) 
a= eo (¥, ) 


Sequence plots of mi" ‘versustand ae 'versus tare useful in assessing convergence. If there are 
5 imputations, then there will be 5 lines (different color) in the same plot. On convergence, for 
each variable j, the traces of different / should be intermingled with each other without showing 
any definite trends, and the variance between different sequences is no larger than the variance 
with each individual sequence. When frequency and analysis weights are involved, the mean and 
standard deviation are calculated using the weights as well. 


Automatic Selection of Imputation Method 


If automatic selection of the imputation method is selected, the method is chosen as follows: 
> Ifthe pattern of missing values is monotone, then the monotone method is used. 


> Otherwise, the fully conditional specification (FCS) method is used. 


Note: only main effects models are used during automatic selection. 
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Special Situations 


When the variable to be imputed is constant over all its observed values, we use this constant to 
impute its missing values. 


Missing Values 


The following cases are not used during imputation. 
m™ Cases with every variable missing. 


= Cases with zero/negative replication or analysis weight. 
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Multiple Imputation: Pooling 
Algorithms 


Analysis of missing values consists of two sequential steps: analysis of each individual complete 
data set to create multiple analysis results and then combining (pooling) these multiple analysis 
results. This algorithm only considers combining the multiple analysis results assuming that 
multiple complete datasets are created and the analysis of each individual complete dataset 

is complete. See “Multiple Imputation Algorithms” for the algorithm for creating multiply 
imputed data sets. See the algorithm of the analysis you’re performing for details on the 

analysis an individual complete data set. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
m The number of multiply imputed sets of complete data. m > 2 is assumed. 
Q = (Qi, -, Qe )’ Parameter vector (with k elements) to be estimated. 


Estimated parameter vector using the i-th set of completed data, =1.....m. 
U" Estimated covariance matrix of Q’? 


Q The final combined estimate of Q 


Rubin’s Rules 


Across all the complete datasets, it is assumed that: 
m the model of the same effects in the same order is fit, 


™ acategorical variable has the same set of categories and the reference category is the same. 


Assuming that each individual analysis result {Q®, ue}, 7 1 is available, the goal is to derive 


the final combined result based on these m individual results. 


Combining Results after Multiple Imputation 


The final estimate of Q is simply the average of individual ones: 
i m 

O=— = ‘>y(7) 

Q m 2s Q 


The estimated total variance is 


T=U;4 (1 ~)B 
mM 
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where B and U are respectively the between-imputation and average within-imputation variance 


calculated by 
1 m . . ! 
p= 53 (0-9) (a"-9 
ees AS Q 
= 
m 
ie Bec) 
m 
1=1 


Special Situations 


Redundant parameters. Standard procedures set redundant parameter estimates at 0 and 
variance/covariance as missing. If a parameter is redundant across all imputations, then the 
combined parameter is still redundant. If there is a parameter that is redundant in some imputations 
but not in others, this causes an error. The reason is that the combined results depend on the 
order of effects in the model (for example x1,x2,x3 or x3,x1,x2 when x3=x1+x2 holds in some 
imputations but not in others) which makes the combined results arbitrary and useless. 


Different sets of parameters. There may be situations in which some model coefficients occur 
in some model fits but not in others (for example, a certain combination of two categorical 
variables occurs in some complete datasets but not in others). If the parameters across 
imputations are different, this causes an error. The reason is that the combined results depend on 
the choice of reference categories of categorical predictors which makes the results arbitrary and 
useless. 


Missing Elements. If there are any missing elements in {Q®, dF be v then we will only use 


the non-missing part to do calculations. 


Scalar Q 


If Q is a scalar (k=1), then 
(Q-@Q)/vT 
has an approximate Student’s t distribution with degrees of freedom 


Um = (m—1) [1 | rt)? 


where r is the relative increase in variance due to non-response 


(1 ++ m') B 
U 


The fraction of missing information about Q due to missing values is 


— 


r+2/(Vvm +3) 
r+1 


v= 
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The relative efficiency (RE) of using the finite m imputation estimator, rather than using an infinite 
number for the fully efficient imputation, in units of variance, is approximately 


RE=(i+ \/m)—* 


Vector Q 


If the number of imputations m is big enough (at least 50,000), then 


| 


(Q-Q) T!(Q-Q) 


— 


V 


has an approximate F distribution with k numerator degrees of freedom and denominator degrees 
of freedom 


y= 1) [1 +P }]° 
where 


r= ; (14 m-*) trace (BU ") 


But for small m (this usually is the case in practice), this approximation is bad because the estimate 
of B is unstable and when m < k, B is not even full rank. Alternatively, we assume that B and U 
are proportional to one another. Under this assumption, a more stable estimate of total variance is 


T=(14+7r)U 
and 
~(Q- Q)"F* Q-Q) ~ Fer 


has an approximate F distribution with k numerator degrees of freedom and denominator degrees 
of freedom @ (Li, Raghunathan and Rubin (1991)), let ¢ = k(m — 1), 


= t(1+k-)(1+7-)7/2 ot <4 
4+(t-4) [1+ (1-2t-!)r-}]? t>4 


Note: 
m= When k=1, 7 reduces to T for a scalar statistic. 


m When k=1, b = v,, ifm <5, andv < v,, ifm > 5. 
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Output Statistics 


Other than Q, U, B, T, we are also interested in some statistics for each individual element of 

Q (for example the vector of regression coefficients). For the jth element of Q, we calculate the 
following. Please notice that the following listed quantities do not use the off diagonal elements of 
matrix T, or B, or U. They are the same as treating each element as scalar and calculating them 
separately. In the following t,, denotes a random variable following a Student’s tdistribution with 
degrees of freedom v, and ¢,, ;_.,/2 denotes the 100(1 — a/2) percentile of the distribution such 
that Pr [t, < ty1-a/2] = 1-4/2. 


Estimate: Q; 

Standard error: se; = \/T); 

Degrees of freedom: 1; 

Confidence interval: Q; + t,,)-0/2 s¢j 


t-value: 1; = Q,/se; 


p-value for hypothesis test : Hy : Qj = 0: p=2Pr [t,, > |f,]] 


Bij 


U5; 


Relative increase in variance due to non-response: 7; = (1 + m—') 
Fraction of missing information: 4; 


Relative efficiency (RE): RE;= (4 d;/m)* 


Hypothesis Tests 


The p-value for testing Hp : Q=Qois 


p= Pr (Fy > F) 


where 
1 j~ Tm (x 
P= 7 \Q —Q)) T (Q—Q») 
hk =rank (1) 
in 


We will also apply this test to scalar statistics. Note that for k = 1 this test does not necessarily 
reduce to the equivalent student t test mentioned in the scalar Q section due to possibly different 
degrees of freedom. 
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General Linear Contrast of Model Parameters 


The above test can be applied to test hypotheses about linear combinations of parameters. For a 
given matrix L and a vector K, Hy : L3 = Kcan be tested, where :3 is a model parameter vector 
(regression coefficients for example). Let Q = L3, Q; = L3;, and U; = LVar (4:) L . This 
test becomes testing Hp) : Q= K 


It is likely that only K, Q; and diagonal elements of U; are available, so the simultaneous test 
of Hy cannot be done. Instead, we will test each row of Hy separately. Denote the /-th row 
hypothesis of Hy as Hy; : Q;= 4). Let p,; be the p-value for testing Hy, alone. If multiple 
comparisons are requested, the p-values are adjusted as usual. 


In multivariate GLM, there is a parameter matrix, B, instead of a vector. In multivariate GLM 
Hy : LBM = K can be tested for the given matrices L, M, K. Where possible, we separately 
test each element of the hypotheses Ho;; : Qi; = 1;Bm; = k;; where |’; is the i-th row vector of 
matrix L, and fk; is the ijth element of vector K. Again, if multiple comparisons are requested, 
the p-values are adjusted as usual. 
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MVA Algorithms 


The Missing Value procedure provides descriptions of missing value patterns; estimates of means, 
standard deviations, covariances, and correlations (using a listwise, pairwise, EM, or regression 
method); and imputation of values by either EM or regression. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Xx Data matrix 

Vij Value of the ith case, jth variable 

u Number of variables 

m Number of cases 

ni Number of nonmissing values of the ith variable 

Nij Number of nonmissing value pairs of the ith and jth variables 
Nc Number of complete cases 

J Index of all variables 


Jy=J(condition) 
I 


Index of variables satisfying “condition” 
Index of all cases 


I(ki,..., #0) Index of cases at which variables are not missing 
I(J) Index of complete cases 

a= [aj] Vector whose ith element is a; 

A = [aij] Matrix whose ith row, jth column element is a,, 


Example to Illustrate Notation 


43 76 34 
¢, 45-72 
44. 15 82 
—_ aa 
sateea li 99 
-, 12 
54 67 
43°" 34 
v2.3 = 72 The 2nd row, 3rd element 
v=3 Number of variables 
n=7 Number of cases 
n2=5 Number of nonmissing values in the 2nd variable 
N23 = Number of nonmissing value pairs in the 2nd and 3rd variables 
Ne =3 Number of complete cases 


Index of variables 
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J(2 or more missing)={1,2} The 1st and 2nd variables have two or more missing values 


J={1,2,3,4,5,6,7} Index of cases 

1(2)={1,2,3,6,7 } Index of cases at which the 2nd variable is not missing 
1(2,3)={1,2,3,7} Index of cases at which the 2nd and 3rd variables are not missing 
I(J)={1,7} Index of complete cases 

T = 43.0 The 2nd element of the vector X = [71, 2, 73] 


Univariate Statistics 
The index j refers to quantitative variables. 
Mean 


x= [75] = [Dyaryy/nz34 € T(j)] 
Standard Deviation 


1/2 
P= ej) = (or ~F;)?/(nj - 1)) 52 E 16) 


Extreme Low 


NL = [nl;| = [number of «:;; values < low_limit j] 


Extreme High 
NH = [nh;] = [number of «;; values > highlimit | 
where 


(— 2% 65 if v * n * logy9(n) 


a | Ly : > 
UES = { 25th percentile of the jth varible ifv *n*logy)(n) < 15 


and 


Pech Nie Tj +2* 0; if v en * logyo(n) 
gue J ~~ | 75th percentile of the jth variable if v * n * logy,(n) 


Separate Variance T Test 


The index k refers to quantitative variables, and index j refers to all variables. 
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ae ‘ tiated 
vik — “k\variable ; is missing 


a}, , _stVariable ; is missin 
Nik Nrke—Mik 
where 7‘), and ¢/). are defined in “Pairwise Statistics”. 
oP *;variable sis missing ‘ 
ik ah Tkk "jk 
df;, = p(2-tail) ;,, = 1-2 |0.5 — tedf(¢;),, dfj,.)| 


p\2 [o i : ites, 
ate) «variable ; is missing 


njh—1 Uhh" jko! 


where “tcdf” is the t cumulative distribution function 


Listwise Statistics 


The indices j and k refer to quantitative variables. 


Mean 

xt = a = [Xyaij/ne3t € T(J) 
Covariance 

cE Ea = [=i (ij -7f) ‘ (wis = Tf) [Ine Ee r(J)] 
Correlation 


Ri= rie = dal ( : i) | 


Pairwise Statistics 


The indices j and k refer to quantitative variables, and ! refers to all variables. 


Mean 


X= [ri] = Eivin/mansé € 14) 
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Standard Deviation 

0? = [of] =|(@(eu-r)teu—n) ite rH) 
Covariance 

CP = [eh] = [Ei(wie — he) * (way — FE) /(rje — st € 10,4) 
Correlation 

a? = [el] = [oil(oh«oh) 


Regression Estimated Statistics 


The indices j and k refer to quantitative variables, and / refers to predictor variables. 


Estimates of Missing 


Values bes a 
R_ J Wij if 7;; is not missing 
“ij ~ ) regression estimatedz;; if 7; is missing 


where the regression estimated «x; is 


ait = Bog + Ui Byij * Tu +eij | € J, = J(l: xynot missing andl ¥ 7) 
where: 


[60,i;s 61,43] is computed from Diag ( x’) = [zy] and by pivoting on the “best” “q” of the 
J, diagonals of C”’. 


“best” is forward stepwise selected. 


“q’ is less than or equal to the user-specified maximum number of predictors; it may also be 
limited by the user-specified F-to-enter limit. 


“ 


c;;” is the optional random error term, as specified: 
1. residual of arandomly selected complete case 
random normal deviate, scaled by the standard error of estimate 


random t(df) deviate, scaled by the standard error of estimate, df is specified by the user 


PF WN 


no error term adjustment 
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Note that for each missing value x; ;, a unique set of regression coefficients (59 ;;,5;,;;) and 
error terms ¢;; is computed. 


Mean 

x= al = [Ein fms E i 
Covariance 

ct = GA = pice -z*) * Gi — 7h) /(n —l)jie 1] 
Correlation 


R= [oi] = [shy er) 


EM Estimated Statistics 


The indices j and k refer to quantitative variables, and / refers to predictor variables. 


Estimates of Missing Values, Mean Vector, and Covariance Matrix 


2. | 40 | = AP — |,P 
a f | = f | 
For m=1 to M, or until convergence is attained: 
If x;; is not missing then a) = 7;;. 


If x;; is missing then it is estimated in the mth iteration as: 


xi = Boi; } Tae xy; le Jo = J(l: x is not missing and! ¥ 7) 


where [sere a is computed from X,,,_; and C,,,_. 


— oe | fd eee (DD ee ee TL ID ayes = 
cn = | ] = [Daw « vi; /Uiwi; 1E I] 


; _ tt fm —m m —m im-—-l| 
iw; ee (ae =e ) * (Cp — Ly, ) T Digdsat lo 
m J J J 2 js|J2 4 * 
Cm = [cr] = 525 Jo, 8 val Jo, ands# J 
J - Sapient ] 
(n — 1) * Y;w;/n 


where c als a is the jth row, sth element of the J pivoted C,,_;. 
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Note that some sources (Little & Rubin, 1987, for example) simply use n as the denominator of 
the formula for C,,,, which produces full maximum likelihood (ML) estimates. The formula used 
by MVA produces restricted maximum likelihood (REML) estimates, which are n/(n—1) times the 
ML estimates. 


l for multivariate normal 

1—a-+ax\?*?/? xexp ((1—A)*D?/2) for contaminated normal 
Wi = 1—a+a*A”/? *exp ((1—A)*«D?/2) 

(df+ p)/(df+ D?) fort(df) 


Q = proportion of contamination 
A = ratio of standard deviations 
p = number of predictors = number of indices in J2 
D?= Mahalanobis distance square of the current case from the mean 


= Dyp (2% — 2") + (ct) + (ot —2P) 


1 
where (cr) is the jkth element of C,,’. 


Convergence 


The algorithm is declared to have converged if, for all j, 


JJ JJ 


cm — mal | Jc. < CONVERGENCE 


Filled-In Data 


Mean 

x” = [F = X77 = ar" | 
Covariance 

CF = [ch] = Cm = [er 
Correlation 


1/2 
E_| E| _ |B EE. E\! 
al ri _ dis (cf, chi) | 


Little’s MCAR Test MVA Algorithms 


ie cae = > (no. of cases in pattern) * (MD) 
each unique pattern 


DFMCAR = S (no. of nonmissing variables) — v 
each unique pattern 


where 


MD = Mahalanobis D? of pattern mean from x! 
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NAIVE BAYES Algorithms 


The Naive Bayes model is an old method for classification and predictor selection that is enjoying 
a renaissance because of its simplicity and stability. 


Notation 

The following notation is used throughout this chapter unless otherwise stated: 
Table 65-1 
Notation 

Notation Description 

Jo Total number of predictors. 

Xx Categorical predictor vector X’ = ( X1, ..., Xj), where J is the number of 

predictors considered. 

Mj Number of categories for predictor Xj. 

Y Categorical target variable. 

K Number of categories of Y. 

N Total number of cases or patterns in the training data. 

Nk The number of cases with Y= k in the training data. 

Nimk The number of cases with Y= k and Xj=m in the training data. 
Tk The probability for Y= k. 

pimk The probability of Xj=m given Y= k. 


Naive Bayes Model 


The Naive Bayes model is based on the conditional independence model of each predictor given 
the target class. The Bayesian principle is to assign a case to the class that has the largest posterior 
probability. By Bayes’ theorem, the posterior probability of Y given X is: 


P(Y _ kX a= x) = z P(X=x|Y¥=k)P(Y=k) 
beet. ba i) P(Y =i) 
i=l 


Let Xj, ..., Xj be the J predictors considered in the model. The Naive Bayes model assumes that 
Xj, .... Xj are conditionally independent given the target; that is: 


P(K=x|¥ =k) =[[j P(X; =2|¥ =) 
These probabilities are estimated from training data by the following equations: 


m= PY =k) = 2B 


Pink P (Xj my k) = 


F Nie + M;f 
Where Nx is calculated based on all non-missing Y, Nimk is based on all non-missing pairs 
of Xj and Y, and the factors | and f are introduced to overcome problems caused by zero or very 
small cell counts. These estimates correspond to Bayesian estimation of the multinomial 
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probabilities with Dirichlet priors. Empirical studies suggest \ = f = +(Kohavi, Becker, and 
Sommerfield, 1997). 


A single data pass is needed to collect all the involved counts. 


For the special situation in which J = 0; that is, there is no predictor at all, 
P(Y =k|X =x) = P(Y =k). When there are empty categories in the target variable or 
categorical predictors, these empty categories should be removed from the calculations. 


Preprocessing 


The following steps are performed before building the Naive Bayes model. 


Missing Values 


A predictor is ignored if every value is missing or if it has only one observed category. A case is 
ignored if the value of the target variable or the values of all predictors are missing. For each case 
missing some, but not all, of the values of the predictors, only the predictors with nonmissing 
values are used to predict the case, as suggested in (Kohavi et al., 1997). 


This implies the following equation: 


P(X=x,|Y =y) = I] PUA SoA Soe) 
{j:7;Mot missing } 


This also implies the following equation for B(J) in average log-likelihood calculations: 


N K 


B(J)= -75 log > Th I] Dik 


w=1 k=l {j:¢;,not missing } 


Where the log() term for case i is ignored if all the values of the predictors considered in the model 
are missing. For more information, see the topic “Average Log-likelihood”. 


Continuous Variables 


The Naive Bayes model assumes that the target and predictor variables are categorical. If there are 
continuous variables, they need to be discretized. There are many ways to discretize a continuous 
variable; the simplest is to divide the domain of a variable into equal width bins. This method 
performs well with the Naive Bayes model while no obvious improvement is found when complex 
methods are used (Hsu, Huang, and Wong, 2000). 


Sometimes the equal width binning method may produce empty bins. In this case, empty bins 

are eliminated by changing bin boundary points. Let bj < bz < ... by be the bin boundary points 
produced by the equal width binning method. The two end bins (-%, bi] and (bp, ©) are non-empty 
by design. Suppose that bin (bj, bj+1] is empty, and suppose that the closest left non-empty bin has 
right boundary point bj (< bj) and the closest right non-empty bin has left boundary point bx (> 
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bj). Then empty bins are eliminated by deleting all boundary points from bj to by, and setting 
a new boundary point at (bj+b,)/2. 


Feature Selection 


Given a total of Jo predictors, the goal of feature selection is to choose a subset of J predictors 
using the Naive Bayes model (Natarajan and Pednault, 2001). This process has the following steps: 


™ Collect the necessary summary statistics to estimate all possible model parameters. 


m™ Create a sequence of candidate predictor subsets that has an increasing number of predictors; 
that is, each successive subset is equal to the previous subset plus one more predictor. 


m From this sequence, find the “best” subset. 


Collect Summary Statistics 


One pass through the training data is required to collect the total number of cases, the number 
of cases per category of the target variable, and the number of cases per category of the target 
variable for each category of each predictor. 


Create the Sequence of Subsets 


Start with an initial subset of predictors considered vital to the model, which can be empty. For 
each predictor not in the subset, a Naive Bayes model is fit with the predictor plus the predictors 
in the subset. The predictor that gives the largest average log-likelihood is added to create the next 
larger subset. This continues until the model includes the user-specified: 


m Exact number of predictors 


or 


= Maximum number of predictors 


Alternatively, the maximum number of predictors, Jjyax, may be automatically chosen by the 
following equation: 


IMax = min {Just + min {100, max (20, #2) }, Jo} 


where J)yust is the number of predictors in the initial subset. 


Find the “Best” Subset 


If you specify an exact number of predictors, the final subset in the sequence is the final model. 
If you specify a maximum number of predictors, the “best” subset is determined by one of the 
following: 


m@ A test data criterion based on the average log-likelihood of the test data. 


m= A pseudo-BIC criterion that uses the average log-likelihood of the training data and penalizes 
overly complex models using the number of predictors and number of cases. This criterion is 
used if there are no test data. 
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Smaller values of these criteria indicate “better” models. The “best” subset is the one with the 
smallest value of the criterion used. 


Test Data Criterion 
Q(J) = —lTest (J) 
Where [Tec (./) is the average log-likelihood for test data. 


Pseudo-BIC Criterion 


Q(J) = —lqyain (J) + 34 


Where J denotes the number of predictors in the model, and —ltyain (J) is the average 
log-likelihood for training data. 


Average Log-likelihood 


The average (conditional) log-likelihood for data {x;, vite , with J predictors is 


1(J) gL=x> log P(Y =y;|X =x;) 


t=1 


N x 
= w)_loe PY — y,) log P(X =xi¥ =H) 
* isl 
N K 
— #) log So rxaxlh= KP =) 


Be 


K 1 K J M; 1 N K J 
=F) Nglog (me) 4 pope N? , log (vi. — 5 >_ log a a 
k=1 Ny peed ** d=1 k=1 j=l 
Let 
K 1 K J M; 
A(J) = $Y_ Nelog (me) + TY2 D2 DO Niu los (Phe) 
k=1 irra re 
N K 
J) =—H) log eae 
1=1 k=1 j=l 
then 


Note: for the special situation in which J = 0; that is, there are no predictors, 
= ber (Y =y;:) = nm log (7x) 


Calculation of average log-likelihood by sampling 
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When adding each predictor to the sequence of subsets, a data pass is needed to calculate B(J). 
When the data set is small enough to fit in the memory, this is not a problem. When the data set 
cannot fit in memory, this can be costly. The Naive Bayes model uses simulated data to calculate 
B(J). Other research has shown that this approach yields good results (Natarajan et al., 2001). The 
formula for B(J) can be rewritten as, for a data set of m cases: 


m K J 
B(J)=-45 log (3: wi] | Pe. 
i=l 


k=1 q=] 


By default m = 1000. 


Classification 


The target category with the highest posterior probability is the predicted category for a given case. 


y(x) = argmax;, {P(Y = k|X =x)} =argmax, {P(X =x 


a ky). PAY = h)} 
Ties are broken in favor of the target category with greater prior probability zx. 


When cases being classified contain categories of categorical predictors that did not occur in the 
training data, these new categories are treated as missing. 


Classification Error 


If there is test data, the error equals the misclassification ratio of the test data. If there is no test 
data, the error equals the misclassification ratio of the training data. 
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Model 


Goal 


NLR produces the least square estimates of the parameters for models that are not linear in their 
parameters. Unlike in other procedures, the weight function is not treated as a case replicate in 
NLR. 


Consider the model 
f=f(x, 9) 


where © is a px1 parameter vector, x is an independent variable vector, and f is a function of 
x and ©. 


Find the least square estimate @* of © such that ©* minimizes the objective function 


F(@)=R WR 

where 

R = (Ri, ...yRn) 

R; = Yi = fi 
fi=flti,@O),i=1,...,n 
W = Diag(W,, ..., W.,) 


and n is the number of cases. For case i, y; is the observed dependent variable, x; is the vector of 
observed independent variables, 1V; is the weight function which can be a function of ©. 


The gradient of F at © is defined as 
VF =23' .;WR 
where J. is the jth column of the 7 x » Jacobian matrix J whose (;, ;)th element is defined by 


J. = Bi ow; Of; 
ay 2W; JO; JO; 


Estimation 


The modified Levenberg-Marquardt algorithm (Moré, 1977) that is contained in MINPACK is 
used in NLR to solve the objective function. 


Given an initial value O°) for ©, the algorithm is as follows: 
At stage k+1,k =0,1,2,... 
Compute 


FA? = f.(O), RY =y— ff, Fe = F(O) and J® = J(@®) 


v 
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> Choose an appropriate non-negative scalar such that 
F(O™) + hy) < Fh 
where 
hy = = (se) + axl) J VRE) 
> Set 
Of) =e) + hy, 
and compute J@*) R&T WEY Fay 
> Check the following conditions: 


1. 1 —s (Fri /Fr) <€\ (SSCON) 
For every element of /;, 


ni < €(PCON) 
3. k+1> ITER (maximum number of iterations) 


4. For every parameter @ ,, the gradient of F at © ,, VF, is evaluated at @(*+1) by checking 


(k+1) 
J 


; | < €(RCON) 
where rv *") is the correlation between the jth column a Y of JO+D) and WETD RED, 


If any of these four conditions is satisfied, the algorithm will stop. Then the final parameter 
estimate ©* 


e* = @lr+)) 


and the termination reason is reported. Otherwise, iteration continues. 


Statistics 


When the estimation algorithm terminates, the following statistics are printed. 


Parameter Estimates and Standard Errors 


The asymptotic standard error of ©; is estimated by the square root of the jth diagonal element 
a;; of A, where 


A = 2 Gewes)! 
n—p 


and j* and w* are the Jacobian matrix J and weight function W evaluated at @*, respectively. 
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Asymptotic 95% Confidence Interval for Parameter Values 


OF + t(0.975,n — p)aii 


Asymptotic Correlation Matrix of the Parameter 
Estimates 


C=D/ap-1/2 
where 
D = Diag(a,,.... App) 


and a;; is the ith diagonal element of A. 


Analysis of Variance Table 


Source df Sum of Squares 
Residual n-p F(0*) 
Regression P SSuncorrected — f'(©") 
Uncorrected Total n S'Suncorrected 
Corrected Total n-1 on oe 
Simcorrected ~ Y > Wi(O") 
i=1 


where 


n 
S'Suncorrected = 3 Wi(O*)y; 
= 


Ti (s-wiern)/(S mie") 
i=1 i=l 
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The purpose of the Multinomial Logistic Regression procedure is to model the dependence of a 
nominal categorical response on a set of discrete and/or continuous predictor variables. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Data Aggregation 


The response variable, which takes integer values from 1 to J. 

The number of categories of the nominal response. 

The number of subpopulations. 

m x p* matrix with vector-element x “the observed values at the ith 


subpopulation, determined by the independent variables specified in the 
command. 


m. X p matrix with vector-element z;, the observed values of the location 
model’s independent variables at the ith subpopulation. 


The frequency weight for the sth observation which belongs to the cell 
corresponding to Y=j at subpopulation i. 


The sum of frequency weights of the observations that belong to the cell 
corresponding to Y=j at subpopulation i. 


The sum of all njj’s. 
The cell probability corresponding to Y=j at subpopulation i. 


The logit of response category j to response category k. 


p X 1 vector of unknown parameters in the j-th logit (i.e., logit of response 
category j to response category J). 


Number of parameters in each logit. p>1. 

Number of non-redundant parameters in logit j after maximum likelihood 
estimation. p> pj" = 0. 

The total number of non-redundant parameters after maximum likelihood 
estimation. p"” =p/>}p? - 

(J —1)p x 1 vector of unknown parameters in the model. 


The maximum likelihood estimate of B. 


The maximum likelihood estimate of 7; 


Observations with negative or missing frequency weights are discarded. Observations 

are aggregated by the definition of subpopulations. Subpopulations are defined by the 
cross-classifications of either the set of independent variables specified in the command or the set 
of independent variables specified in the SUBPOP subcommand. 


Let nj be the marginal count of subpopulation i, 
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J 
ny = es Nij 

j=l 
If there is no observation for the cell of Y=j at subpopulation i, it is assumed that n;; = 0, provided 


that n; # 0 A non-negative scalar 5 € [0,1) may be added to any zero cell (i.e., cell with nz = 0) 
if its marginal count n; is nonzero. The value of 6 is zero by default. 


Data Assumptions 


Let (nj1,..., iy The the J x 1 vector of counts for the categories of Y at subpopulation. It is 
assumed that each (nj), ..., 7) yi is independently multinomial distributed with probability vector 
(i1,+;7:7) of dimension J x 1 and fixed total nie 


Model 


NOMREG fits a generalized logit model that can also be used to model the results of 1-1 
matched case-control studies. 


Generalized Logit Model 


In a Generalized Logit model, the probability 7; ; of response category j at subpopulation i is 


where the last category J is assumed to be the reference category. 


In terms of logits, the model can be expressed as 


log (=) = x ;3; 
forj=1,..., J-1. 


When J = 2, this model is equivalent to the binary Logistic Regression model. Thus, the above 
model can be thought of as an extension of the binary Logistic Regression model from binary 
response to polytomous nominal response. 


1-1Matched Case Control Model by Conditional Likelihood Approach 


The above model can also be used to estimate the parameters in the conditional likelihood of the 
1-1 Matched Case Control Model. In this case, let m be the number of matching pairs, x; be 
the vector of independent variables for the case and xj? that for the control. The conditional 
log-likelihood for the m matched pairs is given by 

exp{ (xin —Xi2 )' ah 


i L+exp{ (xi —xi2 )'a} 
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in which 33 is the vector of parameters for the difference between the values of independent 
variables of the case and those of the control. This conditional likelihood is identical to the 
unconditional log-likelihood of a binary (i.e., k = 2) logistic regression model when 


m There is no intercept term in the model. 
m= The set of subpopulations is defined by the set of matching pairs. 


m™ The independent variables in the model are set to equal to the differences between the values 
for the case and the control. 


m The number of response categories is J = 2, and the value of the response is 1 (or a constant), 
Le. Y= 1. 


Log-likelihood 


The log-likelihood of the model is given by 
m J 
bene el Tij) 
m J exp x {3 
=S°S_ nijlog P( i) - 


i=1 j=1 14 vy a lexp (x Sx 


ll 
un 


A constant that is independent of parameters has been excluded here. The value of the constant is 


c= bi" log {nj!/(nal...niz!)} 


Parameter Estimation 


Estimation of the model parameters proceeds as follows. 


First and Second Derivatives of the Log-likelihood 


For any j = 1,..., J-1,s=1,..., p, the first derivative of | with respect to 3;, is 


m 


Ol oe 
ia. = = Lis (Ni; = iN) 


For any j, j’=1,...,J-l,s,t= ., p, the second derivative of / with respectto 5}, and 3;, is 


m 


Ty .057, =— D NjiLis Lit Nis (5,7 — 7) 


where } = Lif j=j’, 0 otherwise. 
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Maximum Likelihood Estimate 


To obtain the maximum likelihood estimate of B, a Newton-Raphson iterative estimation method 
is used. Notice that this method is the same as Fisher-Scoring iterative estimation method in 

this model, since the expectation of the second derivative of | with respect to B is the same as 

the observed one. 


Let 01/0B be the (.J — 1) p x 1 vector of the first derivative of | with respect to B. Moreover, let 
[0?1/OBOB] be the (.J — 1) p x (.J — 1) p matrix of the second derivative of | with respect to B. 
Notice that —{0?//0BOB] =)" ,X*A;X*" where A; is a (J — 1) x (J — 1) matrix 


. 2g -—J)_(-J)! 
A: =n; (Diag (x, ) eae ie ) 


in which nn? = (T1,-., \ 7-1) and Diag( n)) is a (J — 1) x (J — 1) diagonal matrix of 


m ~) Let B be the parameter estimate at iteration v, the parameter estimate B‘”~'? at iteration 
v + 1 is updated as 


-1 
m 
(v+1)_ RPlv) se eA (U) yx! al 
B =B (5 XA X; JB 
i=l 


and € > 0 is a stepping scalar such that /(B‘”'!)) —/(B\”)) > 0, X* isa (J — 1)p x (.J — 1) matrix 
of independent vectors, 


Xj; 0 see OQ 

Xt = : i 
“9 
0 QO x; 


and A”? is A; and 0//0B”? is 01/0B, both evaluated at B = B\”’. 


Stepping 


Use step-halving method if 1(B‘”~')) — 1(B‘”)) < 0. Let V be the maximum number of steps in 
step-halving, the set of values of € is {1/2V: v=0, ..., Vl}. 


Starting Values of the Parameters 


, 


3 : : 5(0 (0) 
If intercepts are included in the model, set 3\" = (Ge | See 0) where 
m™m 
Nij 
Ot = log (=) = log oe 
=1 


; Nis 


a 
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If intercepts are not included in the model, set 


0 
gl ) 


er (03 40) 


forj=1,..., J-1. 


Convergence Criteria 


Given two convergence criteria ¢;, > 0 and ¢,, > 0, the iteration is considered to be converged 
if one of the following criteria are satisfied: 


1. 2 (BY 1)) ay (B™))| < EK 
. (v+1) | ce o's 
max; |B; — B; | < Ep 
3. The maximum above element in 0//0B‘”~!) is less than min(;, c,,). 


Stepwise Variable Selection 


Several methods are available for selecting independent variables. With the forced entry method, 
any variable in the variable list is entered into the model. The forward stepwise, backward 
stepwise, and backward entry methods use either the Wald statistic or the likelihood ratio statistic 
for variable removal. The forward stepwise, forward entry, and backward stepwise use the score 
statistic or the likelihood ratio statistic to select variables for entry into the model. 


Forward Stepwise (FSTEP) 


1. 


2, 


Estimate the parameter and likelihood function for the initial model and let it be our current model. 


Based on the MLEs of the current model, calculate the score statistic or likelihood ratio statistic 
for every variable eligible for inclusion and find its significance. 


Choose the variable with the smallest significance (p-value). If that significance is less than the 
probability for a variable to enter, then go to step 4; otherwise, stop FSTEP. 


Update the current model by adding a new variable. If this results in a model which has already 
been evaluated, stop FSTEP. 


Calculate the significance for each variable in the current model using LR or Wald’s test. 


Choose the variable with the largest significance. If its significance is less than the probability for 
variable removal, then go back to step 2. If the current model with the variable deleted is the same 
as a previous model, stop FSTEP; otherwise go to the next step. 


Modify the current model by removing the variable with the largest significance from the previous 
model. Estimate the parameters for the modified model and go back to step 5. 
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Forward Only (FORWARD) 
1. Estimate the parameter and likelihood function for the initial model and let it be our current model. 
2. Based on the MLEs of the current model, calculate the score or LR statistic for every variable 
eligible for inclusion and find its significance. 
3. Choose the variable with the smallest significance. If that significance is less than the probability 
for a variable to enter, then go to step 4; otherwise, stop FORWARD. 
4. Update the current model by adding a new variable. If there are no more eligible variable left, stop 


FORWARD; otherwise, go to step 2. 


Backward Stepwise (BSTEP) 


L 


Estimate the parameters for the full model that includes the final model from previous method and 
all eligible variables. Only variables listed on the BSTEP variable list are eligible for entry and 
removal. Let current model be the full model. 


Based on the MLEs of the current model, calculate the LR or Wald’s statistic for every variable 
in the BSTEP list and find its significance. 


Choose the variable with the largest significance. If that significance is less than the probability 
for a variable removal, then go to step 5. If the current model without the variable with the largest 
significance is the same as the previous model, stop BSTEP; otherwise go to the next step. 


Modify the current model by removing the variable with the largest significance from the model. 
Estimate the parameters for the modified model and go back to step 2. 


Check to see any eligible variable is not in the model. If there is none, stop BSTEP; otherwise, 
go to the next step. 


Based on the MLEs of the current model, calculate LR statistic or score statistic for every variable 
not in the model and find its significance. 


Choose the variable with the smallest significance. If that significance is less than the probability 
for the variable entry, then go to the next step; otherwise, stop BSTEP. 


Add the variable with the smallest significance to the current model. If the model is not the 
same as any previous models, estimate the parameters for the new model and go back to step 
2; otherwise, stop BSTEP. 


Backward Only (BACKWARD) 


1. 


Estimate the parameters for the full model that includes all eligible variables. Let the current 
model be the full model. 


Based on the MLEs of the current model, calculate the LR or Wald’s statistic for all variables 
eligible for removal and find its significance. 


Choose the variable with the largest significance. If that significance is less than the probability 
for a variable removal, then stop BACKWARD; otherwise, go to the next step. 
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4. Modify the current model by removing the variable with the largest significance from the model. 


Estimate the parameters for the modified model. If all the variables in the BACKWARD list are 
removed then stop BACKWARD; otherwise, go back to step 2. 


Stepwise Statistics 


The statistics used in the stepwise variable selection methods are defined as follows. 


Score Function and Information Matrix 


The score function for a model with parameter B is: 


U(B) = SP 


The (j,s)th element of the score function can be written as 


[U (B)] ~ Ol( B) 


js OB), 


3 


= Dis (ey = RG) 
1 


tu 


Similarly, elements of the information matrix are given by 


(B)\, 4, = 071(B) 


jpsij't O08; OB jt 
m 


™~ rae 
- ) NL sLieTiy | O55" — 15! 


i=l 


where 0, ;/ = Lif 7 = j , 0 otherwise. 


(Note that 7;; in the formula are functions of B) 


Block Notations 


By partitioning the parameter B into two parts, B, and By, the score function, information matrix, 
and inverse information matrix can be written as partitioned matrices: 


+ U, (B,, Bo 
U (By, Bz) _ ae 


J1( By. Bo) 
= OB, 
= ( JI By. B2) 
Bo 
where | (3B, Bz) =1(B) 


I(B) =I(B,,Bo) 


ae fee 
In, (Bi, Bz) Taz (Bi, Bo) 


071(B,,B2) 07 1( By, B2) 
= OB,0B, OB,OB: 
J71(B,,B2) 07 1( By, Be) 


OB20B, OB20Bz 
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_ = J\, (Bi, Bz) Siz (Bi, Be) 
JB) Ey Ba) Se ( Jy, (Bi, Bz) Joo (Bi, Bz) 
where 
Ji =I) t Ty fyeJo2toi Ty 
Jig = Ly hi2J22 
Jo) —pite 


Jog = [122 — In The] 


Typically, By and Bz are parameters corresponding to two different sets of effects. The dimensions 
of the 1st and 2nd partition in U, I and J are equal to the numbers of parameters in B; and 
Bp respectively. 


Score Test 


Suppose a base model with parameter vector B;,.,. with the corresponding maximum likelihood 
estimate B),.;.. Weare interested in testing the significance of an extra effect E if it is added to the 
base model. For convenience, we will call the model with effect E the augmented model. Let 

Bg be the vector of extra parameters associated with the effect E, then the hypothesis can be 
written as 


Hy: Be =0vs.H, Be# 0 


Using the block notations, the score function, information matrix and inverse information of the 
augmented model can be written as 


= Ubase (Boase o) Br ) 
U Boase, B == Pr 
(Boase, Be) ( Uz (Base, Ba) 


Thase LSE Boase ,B ay Tbase, D Boase iB 7 
I (Binwes Be) = ( ; it ( t E) t B ( l ) 


Tr base (Boase, BE) Troe (Boases Bre) 


i (B Br ; 2(B , 
J ( Bia. , Br) = ( Jase base ( base) E) J base EB ( base a) 


JE base (Base, BE) JTree (Biase; Be) 


Then the score statistic for testing our hypothesis will be 
ae oF T : 
s= Ur (Bras ,0) JELB Bias: ,0) Up (Boas ,0) 


where Up, ( Boas: 0) and Je pg ( Bend, ,())are the 2nd partition of score function and inverse 
information matrix evaluated at B,,,... = Bias. and Be =0. 


Under the null hypothesis, the score statistic s has a chi-square distribution with degrees of 
freedom equal to the rank of Jf (8), By). Ifthe rank of Jp, (8), Bz) is zero, then the score 
statistic will be set to 0 and the p-value will be 1. Otherwise, if the rank of J; -(B,, Bz) is 

Tg: Tg > 0, then the p-value of the test is equal to 1 — F(s;r,;) is the cumulative 


distribution function of a chi-square distribution with rg degrees of freedom. 
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Computational Formula for Score Statistic 


When we compute the score statistic s, it is not necessary to re-compute U (Brase,0) and 


I ( ee ,0} from scratch. The score function and information matrix of the base model can be 
reused in the calculation. Using the block notations introduced earlier, we have 


R base (Biase 0) U (Boas ) 
U (Bias ,0) ~ = ~ 
Up (Brose: 0) Up ( Baws 0) 


and 


x vi (Boase) Leageek (Brass) 
is (Boas ,0) = ; 
IE base (Bras .0) Ip eg (Boas .0) 


In stepwise logistic regression, it is necessary to compute one score test for each effect that are not 
in the base model. Since the 1st partition of U (Boas .0) and / (Boas .0) depend only on the 


base model, we only need to compute Up, (Bras ,0), Tease B (Boas ,0) and Ip 2 (Boas ,0) for 
each new effect. 


If 3), is the s-th parameter of the j-th logit in B,,,. and G,¢ is the t-th parameter of k-th logit in 
Br, then the elements of Up (Boas 0), lease, E (Bras .0) and Jr pr (Bis, ,0) can be expressed 


as follows: 

es m 

[Ve (Bras 0)] = 2 Lit (Nik — Nein) 
i= 


Ip p (Brass 0)] ; = > , NLL it! Wik (iene a ik’) 
Ktskt! 


i=1, 


m 


Loase i) (Bras ,0)] -@ kt Se S NjLisLit Ni; (djr =, Tik:) 
L Js, a 
. t=1 


where 7;;,, 7,’ are computed under the base model. 


Wald’s Test 


In backward stepwise selection, we are interested in removing an effect F from an already fitted 
model. For a given base model with parameter vector B;,.,., we want to use Wald’s statistic to 
test if effect F should be removed from the base model. If the parameter vector for the effect F is 
Bp, then the hypothesis can be formulated as 


Hy: Br =0 vs. H,: Be 40 
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In order to write down the expression of the Wald’s statistic, we will partition our parameter vector 
(and its estimate) into two parts as follows: 


- B 1ase\ F 
Brase = Bhase\ and Brase = meee 
Br Br 


The first partition contains parameters that we intended to keep in the model and the 2nd partition 
contains the parameters of the effect F, which may be removed from the model. The information 
matrix and inverse information will be partitioned accordingly, 


a6: ee Ibase\Fbase\F (Bhase\F: Boase\F) Tbase\FF (Boase\ Fs Br) 
sn Sel Ip pase\F (Boase\r, Br) Ir r (Boase\F, Br) 

and 

FEU Se Stine Phases | Phase\ Fs Piasek) “Ohaseee | Boases rs Br) 

were Jr pase\F (Boase\r, Br) Jrr (Boase\ Fr: Br) 


Using the above notations, the Wald’s statistic for effect F can be expressed as 
w= Br|Jer (Boase\F: By)| Br 


Under the null hypothesis, w has a chi-square distribution with degrees of freedom equal to the 
rank of Jey (Brase\ er, Br) . If the rank of Jer ( Braye, 7, Be )is zero, then Wald’s statistic will be 
set toO and the p-value will be 1. Otherwise, if the rank of Jp.» (Brase\ rp; Be) iSrp :rp > 0, then 
the p-value of the test is equal to 1 — F'(w;r,-), where F (w;7r,-) is the cumulative distribution 
function of a chi-square distribution with ;,, degrees of freedom. 


Statistics 


The following output statistics are available. 


Model Information 


The model information (-2 log-likelihood) is available for the initial and final model. 


Initial Model, Intercept-Only 


If intercepts are included in the model, the predicted probability for the initial model (that is, 
the model with intercepts only) is 


and the value of the -2 log-likelihood of the initial model is 


m J 


~21 (7) = -25¢ ye nij log (77i;) 


i=1 j=l 
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Initial Model, Empty 
If intercepts are not included in the model, the predicted probability for the initial model is 
ttiy =F 


and the value of the -2 log-likelihood of the initial model is 


Final Model 


The value of -2 log-likelihood of the final model is 


m J 


—21 (7) = 25° S¢ nij log (74;) 
i=1 j=1 


Model Chi-Square 


The Model Chi-square is given by 
—21 (7) — {—21 (7)} 


Model with Intercepts versus Intercept-only Model 
If the final model includes intercepts, then the initial model is an intercept-only model. Under the 


null hypothesis that Hy : sintercepts — 9, the Model Chi-square is asymptotically chi-squared 
distributed with p™ — (J — 1) degrees of freedom. 


Model without Intercepts versus Empty Model 
If the model does not include intercepts, then the initial model is an empty model. Under the 


null hypothesis that Hy : 5 = OQ, the Model Chi-square is asymptotically chi-squared 
distributed with p™ degrees of freedoms. 


Pseudo R-Square 


The R? statistic cannot be exactly computed for multinomial logistic regression models, so these 
approximations are computed instead. 


Cox and Snell’s R-Square 
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Nagelkerke’s R-Square 


Re 
2 CS 
Ry = 1—L(#)2!" 


McFadden’s R-Square 


Measures of Monotone Association 


When the response variable has exactly two levels; that is, k = 2, measures of monotone 
association are available. 


Without loss of generality, let the predicted probability for the category which is not the base 
category be 7;;. Also, let s; = [500 x 7;;]/500 where [x] is the integer part of the value x. 


/ 


Take a pair of observations indexed by i and iz with different observed responses; the smaller 
index corresponds to a lower predictor value. This pair is a concordant pairif s;, < s;, for 
ij<iz. This pair is a discordant pair if s;, > s;, for i1<iz. If the pair is neither concordant nor 
discordant, it is a tied pair. Suppose there are a total of t pairs with different responses, m,. pairs 
are concordant, m.,; pairs are discordant, and t — m,. — m, pairs are tied. The following measures 
of monotone association are computed. 


Somers’ D 


D = (m_ —ma)/t 


Goodman-Kruskal’s Gamma 


Gamma = (me — mq)/(Me + Ma) 


Kendall’s Tau-a 
Tau — a = 2(m. — mg)/(n(n — 1)) 


m ik 
where nis the total sum of all frequencies n= > yy, az 
i=1 j=l 


Concordance Index C 


C = (me. + (t — me — ma)/2)/t 
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Goodness of Fit Measures 


The following tests of the null hypothesis that the model adequately fits the data are available. 


Pearson Goodness of Fit Measure 


m J - 2 
Yu aS (ni; — N;7;;) 


n a 
i=1 j=l boty 


Under the null hypothesis, the Pearson goodness-of-fit statistic is asymptotically chi-squared 
distributed with m(J — 1) — p™ degrees of freedom. 


Deviance Goodness of Fit Measure 


m J 
— ni; 
D=2) ) nj; log (7) 


nyt 
i=1 j=1 ea) 


Under the null hypothesis, the Deviance goodness-of-fit statistic is asymptotically chi-squared 
distributed with m(J — 1) — p™ degrees of freedom. 


Overdispersion Adjustments 


Let &> 0 be an estimate of the overdispersion parameter. Possible estimates of this parameter are: 
m A positive value specified in the command. If no value is specified, 1 is assumed. 


m The ratio of Pearson goodness-of-fit measure to its degrees of freedom: 


a >. as 
Ae m(k—1)—p™r 


m= The ratio of Deviance goodness of fit measure to its degrees of freedoms: 


D 


as m(k—1)—p™" 


Covariance and Correlation Matrices 


The estimate of the covariance matrix of the parameters is the inverse of the negative of the 
second derivative of the log-likelihood evaluated at B = B‘”’, multiplied by the estimate of the 
overdispersion parameter. 


-1 


Cov(B) =4 bs XtA,x?" 
i=1 


Let & be the (J-1)p 1 vector of the square roots of the diagonal elements in Cov (B) . The estimate 
of the correlation matrix of B is 


Cor(B) = Diag(o~')Cov(B) Diag(a—") 
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Parameter Statistics 
An estimate of the standard deviation of B;, is ¢;,. The Wald statistic for B;, is 
Waldj, = 342 


Under the null hypothesis that Hp : Bj; = 0, Wald’. is asymptotically chi-square distributed 
with 1 degree of freedom. 


Based on the asymptotic normality of the parameter estimate, a 100(1—a )% Wald confidence 
interval for B;, is 


Bis t 24-a/27js 


where 2; _,, /2 1s the upper (1—o/2)100th percentile of the standard normal distribution. 


Predicted Cell Counts 
At each subpopulation i, the predicted count for response category Y=j is 
Rig = MUM; 


The (raw) residual is n;; — 7; and the standardized residual is (n;; —7;;) /\/mimij (1 — 7i;) . 


Likelihood Based Partial Effects 


A likelihood ratio test is performed for any effect (except intercept) in the model. The procedure 
to perform a likelihood ratio test for any effect e is as follows: 


1. Forma submodel that has all the effects in the working model but the one (e) of interest. 


2. Fit the submodel and calculate the value of its —2 log-likelihood, denote it by —9/ (i, )): 
Moreover, let the number of non-redundant parameters in this submodel be p7”” 


(e)* 
3. Calculate the difference between the —2 log-likelihood of the submodel and that of the working 
model, {—2I (7(.)) } — {—2l(7)}. 


Under the null hypothesis that the effect e of interest is zero, {—2I (7.))} — {—2/(7)} is 
asymptotically chi-square distributed with p”” — p/’" degrees of freedom. 
Linear Hypothesis Testing 


For each q x p matrix of linear combinations L, J Wald’s tests are performed. Each of the first 
J — 1 Wald’s tests corresponds to a Wald’s test on each of the J — 1 logits. The last Wald’s test 
corresponds to a joint Wald’s test for all the J — 1 logits. In the following, it is assumed that 

q = Rank(L) < p. 


The Wald’s test corresponding to the jth logit is 
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Wald(L, j) = (13,) {Lcov(4,)u'}~* (3,) 


Under the null hypothesis, Hy : L3; = 0, Wald(L, j) is asymptotically chi-square distributed 
with q degrees of freedom. 


Let L* be a (J — 1)qx (J-—1)p matrix, 


L Q 0 

fea |" 2 
-. 0 
0 0; L 


The Wald’s joint test for all logits is 
Wald(L, ) = (LB) {t*cov (B) uy (LB) 


Under the null hypothesis, Ho : L*B = 0 ,Wald(L, ) is asymptotically chi-square distributed with 
(J—1)q degrees of freedom. 


Classification Table 
Suppose that c(j, j’) is the (j, j’)-th element of the classification table, j, j? = 1, ..., J. cGj, j’) is 
the sum of the frequencies for the observations whose actual response category is j (as row) and 
predicted response category is j’ (as column) respectively. 
The predicted response category for subpopulation i is 
jo oim * = max; (7;;) 
a . . 


Should there be a tie, choose the category with the smallest category number. 


For j,j’ = 1, ..., J, cG, j’) is given as 


m 
o(i4') = Smid 
t=1 


The percentage of total correct predictions of the model is 


yn Geers 
p()= (Sage?) 100% 


The percentage of correct predictions of the model for response category j is 


p= (st ) 100% 
eM alba 7 
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Checking for Separation 


The algorithm checks for separation in the data starting with iteration y“DKSeP (20 by default). To 
check for separation: 


For each subpopulation i, find 7 : 7 * = max; (7;;). 
a 7 . 


2. If " — |’, then there is a perfect prediction for subpopulation i. 


3. Tf all subpopulations have perfect prediction, then there is complete separation. If some patterns 
have perfect prediction and the Hessian of B is singular, then there is quasi-complete separation. 
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If a WEIGHT variable is specified, it is used to replicate a case as many times as indicated by 
the weight value rounded to the nearest integer. If the workspace requirements are exceeded and 
sampling has been selected, a random sample of cases is chosen for analysis using the algorithm 
described in SAMPLE. For the RUNS test, if sampling is specified, it is ignored. The tests are 
described in (Siegel, 1956). 


Spearman Correlation Coefficient 


For each of the variables X and Y separately, the observations are sorted into ascending order and 
replaced by their ranks. In situations where t observations are tied, the average rank is assigned. 
Each time ¢ > 1, the quantity t° — ¢ is calculated and summed separately for each variable. These 
sums will be designated ST’. and ST). 


For each of the N observations, the difference between the rank of X and rank of Yis computed as: 
d; = R(X;) — R(Yi) 


Spearman’s rho (p) is calculated as (Siegel, 1956): 


N 


Ty Ba Ty cag 2 d 


i=] 


P= 
24/TaTy 
where 
Ar3__ A\T__ at V3_N_op 
T= N 4 Si and Ty = N a Si, 


If T,, or T, is 0, the statistic is not computed. 


The significance level is calculated assuming that, under the null hypothesis, 


is distributed as a t with NV — 2 degrees of freedom. A one- or two-tailed significance level is 
printed depending on the user-selected option. 


Kendall’s Tau 


For each of the variables X and Y separately, the observations are sorted into ascending order and 
replaced by their ranks. In situations where t observations are tied, the average rank is assigned. 


Each time ¢ > 1, the following quantities are computed and summed over all groups of ties for 
each variable separately. 
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" 


Ty = X(t? —t)(2t+5), andy =aory 


Each of the N cases is compared to the others to determine with how many cases its ranking of 
X and Yis concordant or discordant. The following procedure is used. For each distinct pair of 
cases (i, 7),7 < 7 the quantity 


di; = [R(X;) — R(X) |[R(¥;) -— R)] 


is computed. If the sign of this product is positive, the pair of observations (7,7) is concordant, 
since both members of observation i are either less than or greater than their respective 
measurement in observation j. If the sign is negative, the pair is discordant. 


The number of concordant pairs minus the number of discordant pairs is 


N-1 WN 
523° Sema) 
i=1 j=itl 
where sign(d;;) is defined as +1 or —1 depending on the sign of d;;. Pairs in whichd;; =0 


are ignored in the computation of S. 
Kendall’s tau(7) is computed as 


If the denominator is 0, the statistic is not computed. 
The variance of S is estimated by (Kendall, 1955): 


1 { ape Re <a ” ” \ T 7 Tr Tax Ty 
= KON +3) —-4o a a 
d 1 \ ( + ») x y * OK(IN—2) IK 
where 
K=N?_-N 


The significance level is obtained using 
Vd 


which, under the null hypothesis, is approximately normally distributed. The significance level is 
either one- or two-sided, depending on the user specification. 
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Nonparametric tests make minimal assumptions about the underlying distribution of the data. 
The available nonparametric tests can be grouped into three broad categories based on how 
the data are organized: one-sample tests, related-samples tests, and independent-samples tests. 
A one-sample test analyzes one field. A test for related samples compares two or more fields 
for the same set of records. An independent-samples test analyzes one field that is grouped by 
categories of another field. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


{aes :fi be Data for one sample tests: x; is the ith observed value, and fi is the 
frequency/replication weight for x;. 

{wi,+°+,tiK, fi}, Data for K related samples tests: each x-column represents one sample, f; is 
the frequency/replication weight for row/recordi. 

{xi, gi, fy. Data for K independent samples: g; indicates the sample that observation 

es a x; belongs to, f; is the frequency/replication weight. 

Gj ={i:g=j} All record indices in the jth sample. 

nj The number of records in the jth sample, ignoring the frequency weight. 

mes ys f The number of records in the jth sample, incorporating frequency weight. 

7€€ 

rank (g (a;);D)} The rank of g (;) whenall {g (x) : z € D} are jointly ranked. If there are 
ties, the average rankis used. 

rank (g (ai); V.£) Like rank (g (2; ); D) but frequency weights are incorporated when 
calculating the ranks. 

F, (x) The cumulative distribution function of populationk. 

P (2) The cumulative distribution function forthe standard normal distribution 
such that. 

0, The critical value for determining whether to reject the nullhypothesis. 


One-Sample Tests 


The following one-sample tests are available. 


Binomial Test 


For a categorical field with 2 values (or recoded categorical field with more than 2 values or 
recoded continuous field), this tests: 


Ho: The probability of success is equal to the hypothesized success probability po. 


Ha (if pp=0.5): The probability of success is not equal to the hypothesized success probability 
(use the two-tailed p-value) 
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Ha (if po>0.5): The probability of success is greater than the hypothesized success probability 
(use the one-tailed p-value) 


Ha (if po<0.5): The probability of success is less than the hypothesized success probability 
(use the one-tailed p-value) 


Let ,, and 1, ¢ be the numbers of records in the success and failure categories, incorporating 
the frequency weight. 


If nif + n2,¢ < 25, then one-tailed exact probability is 


p, = min {Pr (T< mf |Ho), Pr (> ny. Hy) } 


where 
nif 
Pr (7 < nq|Ho) = ys (PY + 9,F ire — po) ms trast 
i=0 
and 
My ptNep / 
Pr (T > m1,¢|Ho) = DS ( : rs ) i je 


i=n 1,f 
The two-tailed exact probability is p = 2p). 
If ny,¢ + m2, > 25, anormal approximation is used. Letting 


nif + 0.5— (n Lf + no 7) Po 


= 
(n1.¢ +72.) po (1 — po) 

and 

Zy = ny f= 0.5 - (nf — no f) Po 


(n1,¢ + 2,) po (1 — po) 


the one-tailed approximate probability is 
pi =min{®(Z,),1—@(Z2)} 


and the two-tailed approximate probability is p = 2p). 


p < a rejects the two-tailed test if po = 0.5; otherwise p, < a rejects the one-tailed test. 
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Confidence Interval for Binomial Success Rate 


Without loss of generality we assume that 2; = 0 or 1 with 1 representing success. We 


want to estimate the success probability p = Pr (a = 1) and its confidence interval. Let 
T= fil (ai = 1) = OY, Siti, ny = O"_, fi. The estimate of the success probability is 
p = +. For confidence interval ();,,;;), we provide the following three ways of calculating it. 


For all three methods, p, = 0 if 6 =Oand py; =1if p=1 


Clopper-Pearson confidence interval 


The Clopper-Pearson confidence interval is an exact confidence interval based on inverting the 
exact equal-tailed binomial test Ho : » = po . The lower and upper confidence limits are found 
by solving 


NF 


ye Gaze — pry =a/2 
T 

nf Mt ne 
S- ; pu(1 — pu) =a/2 


i=0 


The solutions to these two equations are (Leemis and Trivedi, 1996) 


= 
in nf — T+1 
) — ee 
on TP p20 (np 1) 
-1 
angel 

Lt See 

(7 + 1) Fi_a/2 (2 (7 we 1) ,2 (np caf )) 
where Fy /2 (v1, v2) is the a/2 percentile of the F-distribution F (v, v2). 


Note: The Clopper-Pearson confidence interval is conservative (coverage probability is at least 
1 — a) because of the discreteness of the binomial distribution. The coverage probability can be 
much larger than 1 — @ unless sample size is very big. 


Jeffreys confidence interval 


Jeffreys confidence interval is a Bayesian interval based on the posterior probability of p using the 
Jeffreys prior Beta(4, +). The resulting posterior forpis Beta(T + 4,ny;—T 4 +). Then the 
lower and upper confidence limits of p are the 5 and 1 — $ percentiles of this beta distribution 


pr = Ba(T+5,np-—T +34) 


a 
ay 


pu = By-e(T + 5) Mf —T +: 


Nile 
SS 
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Likelihood ratio confidence interval 


The likelihood ratio confidence interval is constructed by inverting the acceptance region of the 
likelihood ratio test which accepts the null hypothesis Ho : p = po if 


Dike Lik (po) < Ps (1) 


ed (Lik (q »)) 


or 
-2(1(p0) - oi i) ecw) 


where \7_,, (1) is the 1 — a percentile of the chi-square distribution with 1 degree of freedom, 
Lik (p) and J (p) are likelihood and log likelihood functions. The log likelihood function is 


I (p) = log (Lik (p)) = T logp + (n¢ — T) log (1 — p) 
with the convention 0 log (0) = 0. 


Inverting the likelihood ratio test is to find a range (p;,, py’) for pp such that within this range 
—2(I (po) — U(p)) < x39, (1) or equivalently 1 (p9) — 1(p) 4 Aunt) 
h(po) =U (po) — U(p) 4 xi-a()) is well behaved with h (0) = —90,h(1) = —oo, maximum at 


> 0 is satisfied. The function 


; 2 0) 2 . , sacha en ee 
h(p) = sis a increasing isp po < pand decreasing for po > p because its first derivative is 


h bs) T —nypo n¢(p— po) >0- po <p 
UNDO)! eye a ee = PO =P 
Po(l1—po)  Ppo(1—po) <0 po>s 
The two solutions for h (py) = 0, one on each side of », correspond to p;, (< p) and pr; (> p). To 
obtain the solutions, the Newton-Raphson iterative method is used to solve the equation h (po) = 0. 
Letting p'"? be the solution at iteration step v, the solution p‘’ *') at iteration step v + 1 is updated as 
h (o) p) (1 =p) 
(v+1) _ (v) _ _ (v) h (pl) 
) =p ———+_ =p ————+———h lp 
: : oF (p)) é T — ngp) d 


The stepping scalar € > 0 is used to make sure |h (p'"*!))| < |h(p'"”))| and 0< pl") <1. We 
use the step-halving method if either |h (pr) <|h (p'))| or 0 < p\"*!) < 1 is not satisfied. 
Let s be the maximum number of steps in step-halving, the set of values of € is {1/2': r= 0, 

. Sl}. 


Iterations start with an initial value p‘°’ and continue until one of the stopping criteria is reached. 


Only p;; needs to be calculated whenp = Obecause p; = 0; and only p,, needs to be calculated 
when p = 1 because py = 1. In fact, a closed form solution exists in these special situations, 
pu =1-—exp (-“422) for p = 0, andp,; = exp (-4e) for p = 1. 
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Initial values. Any initial value p\’ € (0, pwill lead to asolution for p,, and any initial value 
p'”) € (p,1) will lead to py. Let p,_;,p,, be the Jeffreys lower and upper confidence limits. We 
will take the following as the initial values for the lower and upper confidence limits 


pO — 2 PhJ ifpp y <P 
£ p/2 otherwise 


») = Pu,s ifpyy > P 
U (p+1)/2 otherwise 


Stopping criteria. Let € = 10~®. The following stopping criteria are checked in the following order. 
1. Absolute argument convergence criterion: [p' vr) _ p| <e€ 


2. Relative argument convergence criterion: 


3. Function convergence criterion: |a(p'"*))| <e 


4. The maximum number of iterations, default at 50, is reached, or the maximum number of steps in 
step-halving is reached, default at 20. 


Chi-Square Test 
For a categorical field, this tests: 
Ho: The probability of each category i equals the hypothesized probability P.. 
Hy: At least one category’s probability does not equal its hypothesized probability. 


The test statistic is 


i yas 5205 
) ; —— dd 
» ROY 2 EXP) 
. a EXP, 


where O;., and E.X P; = Pn; are the observed and expected frequencies of category i. 


The one-sided p-value is 
2 2 2 2 
p= Pr (yg-1 2°) =1—Pr (xk < 0°) 


where \7._; follows achi-square distribution with k — 1 degrees of freedom. 


p <q fejects the null hypothesis. 
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Kolmogorov-Smirnov Test 
For a continuous field, this tests: 
Ho: £ (x) = Fo (x) for all x, where F (2°) is the distribution of the sample and Fo (.) is the 
hypothesized distribution which can be the uniform, the Poisson, the normal or the exponential 


distribution. 


Ha: F (x) ¢ Fo (x) for some x. 


Empirical cumulative distribution function 


The observations are sorted into ascending order: a(;) < w2) < +++ < 2%), where m is the number 
of distinct values of X. Then the empirical cdf is 


Theoretical cumulative distribution function 


Uniform 


vj; — min 
Fo (aj) = 


max — min 
where min and max are user-specified (default sample minimum and maximum). 


Poisson 


i 


ry! 


where ) is user-specified (default sample mean). If A > 100,000, the normal approximation is 
used with = A and o = VA. 


Normal 


Fy (’;) = o(% — H 
Oo 


where ;: and o are user-specified (default sample mean and standard deviation). 


Exponential 
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Fy (aj) = 1 — e P™ 
where 3 is user-specified (default inverse sample mean). 


Test statistic and p-value 


The test statistic is calculated based on differences between the empirical cumulative distribution 
and the theoretical cumulative distribution. For the uniform, normal and exponential distributions, 
two differences are computed: 


Dj = F («@-1)) — Fo (2() 
Di = F (xq) — Fo (eq) 


for i=1,...,m. For the Poisson: 


motes (mw — — Fy (xii) - 1) , xq > 0, 


for i=1,...,m 


The test statistic is 


The two-tailed probability level is estimated using the first three terms of the Smirnov (1948) 
formula. 


ast. oh p<. 52077 
rn ag vor (Q+Q9+Q), Q=e" 2 027<Z<1 
2(Q-—Q4+Q9-Q'*), Q=e%% t<fesi 

0 Z>3.1 


p <q rejects the null hypothesis. 


Note: If the distribution is normal and parameters are estimated from the data, then the Lilliefors 
method is used to compute the test statistic and p value instead of the method described in this 
section. For more information, see the topic “Kolmogorov-Smirnov Statistic with Lilliefors’ 
Significance”. 
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Kolmogorov-Smirnov Statistic with Lilliefors’ Significance 
In the case that the distribution is normal and parameters are estimated from the data, the 
Lilliefors method (Lilliefors, 1967) is used to compute the test statistic and p value. The Lilliefors 
significance p is calculated based on the formulas and critical value tables from Lilliefors 
(Lilliefors et al., 1967) and Dallal and Wilkinson (Dallal and Wilkinson, 1986). 


The test statistic is 


D = max; ( D,|); \Dil) 


Let ny = > f;- Ifny <5, the p value is set to the system missing value. 


j=l 


If ny > 5, then the Lilliefors significance p is calculated as follows: 


Step 1. Compute the critical value Do, for upper tail probability 0.1: 


(-» = VP Tac) 


2a 


Doi = 


where, if ny; < 100, then a = —7.01256 (n¢ + 2.78019), b = 2.99587 \/n + 2.78019, and 


c = 2.1804661 4 0.974598 | 167997 

ALS TF 
and if ny > 100, then g@ = —7.90289126054 « np°"S, b = 3.180370175721 *n,0"%, and 
c = 2.2947256 


Step 2. If D= Dou; then p= O.1 
If D > Do, then p = exp {aD? + bD + ¢ — 2.3025851} 


Otherwise go to step 3 


Step 3. If there is an entry in Table 69-1 for sample size n , then go to step 4 to compute the p 
value. Otherwise, linear interpolation is used to calculate the critical values fora sample size of n;. 


For example, for n+ = 22, which is between sample sizes s| = 20 and s2 = 25 with critical 
values cl = 0.159 and c2 = 0.143 respectively, the critical value for upper tail probability 0.2 
is computed as: 

ca=cl 0.143 — 0.159 


The critical value for upper tail probability 0.15 (for n = 22) is computed in a similar manner. 
Step4.If Doi5 <D < Do. or Do.2 < D < Do.15, then linear interpolation is used to compute the 


p value, where Dp.» and o.15 are the critical values for upper tail probability 0.2 and probability 
0.15 (corresponding to sample size n ,) respectively. 
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For example, for n; = 20, use step 1 to compute the critical value for upper tail probability 
0.1 as Do. = 0.1772025. If D = 0.170, then 0.166 = Do.is <D < Do... The p value can be 
computed as: 


0.1 — 0.15 
p= ——__—— + (D Do.15) + 0.15 = 0.1321469 
Do. — Do.is 


If D < Do», then p is reported as > 0.2. 


Table 69-1 
Upper tail probability and corresponding critical values 


p value 0.2 p value 0.15 
028 0308 


0.269 0.281 


0.252 0.264 


0.239 0.250 


ser 0238 
10 


0.217 0.228 


0.208 0.218 


0.200 0.210 


0193 0202 
14 


a784 
0.179 
0.175 
0.170 
0.166 
0.150 
0.138 


Runs Test 


For a categorical field with 2 values (or a recoded categorical field with more than 2 values or a 
recoded continuous field), this tests: 


Ho: The observed order of observations of a field is attributable to chance variation. 
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Number of runs 


The number of times that the category changes; that is, where x; belongs to one category and 
x;4, belongs to the other, as well as the number of records in category 1 (1, ;) and category 2 
(no, 7), are determined. The number of runs, R, is the number of sign changes plus one. 


Test statistic and p-value 


The sampling distribution of the number of runs is approximately normal with 


2n1 fn, f 
t — ee Se ee ie 
PE nif f Nf 
2ny fnof (2ny prop — i no 7) 
r= fi2,, fhe2,, ; 


(mf { nt)” (m4 5 | ny ¢ —1) 

The test statistic is 

z= aoEn, if ny > 50, otherwise 

fae +0.5)/or if R-pR < —0.5 
oO 


R-pr—05)/or if R-prR>O0.5 
0 if |2-pp| < 0.5 


The one sided p-value is p) = Pr(Z > |z|) = 1— ® (|z|) and the two sided p-value is po = 2p). 


p2 < a rejects the null hypothesis. 


Wilcoxon Signed-Rank Test 
For a continuous field, this tests: 
Ho: median (X ) = @ where @ is user-specified (default to sample median). 


Let dj = x2; — 0 D = {d; : |d;| #0} The test statistic is the sum of positive ranks incorporating 
the frequency weights: 


T =) _ fiank (|d;|;D,£)1(sgn (di) > 0) 
1€D 

The standardized test statistic is 

T = LT 


pe 
= op 


where 
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1 
ire qt (my + 1) 


, f . > 1 AML 3 
oh = ays (my +1) (Qng +1) — Fed hy — th) 


“j=l 
y= 


1ED 


where M is the total number of distinct rank values of |d;| > 0 and ¢;,/ is the number of records 
tied at the jth distinct value, incorporating the frequency weights. 


The asymptotic one-sided and two-sided p-values are 


MHP 2 |T"|) Se (\T*|) 


p= 2p 


pi < a rejects the null hypothesis in favor of median (X) > Oif 7* > 0 and in favor of 
median (X) < dif T* <0. 


Note: The one-sample Wilcoxon signed-rank test is equivalent to the matched-pairs Wilcoxon 
signed-rank test when the second sample replaced by a constant 0. 


Independent Samples Tests 


The following independent-samples tests are available. 


Mann-Whitney Test 


For two independent samples from a continuous field, this tests: 


Ho: F, (x) = F> (x); that is, the two samples are from populations with the same distribution 
function 

Ay: Fy (a) > Fo (a) 

H4: F, (x) < Fo (2) 


The first group is defined by the first value of the grouping field in ascending order. 


Calculation of Sums of Ranks 


The combined data from both specified groups are sorted and ranks assigned to all records, with 
average rank being used in the case of ties. The sum of ranks for each group is 


Nonparametric Tests Algorithms 


f= De firank (aj; Dy U Do, f)F (x; € D1) 
1€G, 
Sop = S— firank(x;; Dy U Do, f)I (27 € D2) 
i€G2 
The average rank for each group is 
S; = Oe ¢/ Ta? 
where nif = ica: fil (a; € Di). 
If there are tied records, the number of records tied at the jth distinct value incorporating the 


frequency weight, ¢;,, are counted. 


Test statistic and p-value 


The Wilcoxon rank sum W statistic is 1” = S2,». The Mann-Whiney U statistic for group 1 is 


U= vice, jEGs fifjl (ai < vj) F ; ied, jeEG, fifjl (vi = vj) 
map (M1, r 


ny, pn,g + Mts _ Sy 


Ifny pno,p < 400 and ny pro, p/2 + min (ny, 7, 12,7) < 220, the exact significance level is based on 
an algorithm of Dineen and Blakesley (1973), which is given as follows: 


Let fi; (vu) be the sampling frequency of the Mann-Whitney statistic for a value of U and with 
sample size i andj. Then the frequency distribution of the Mann-Whiney U statistic can be derived 
by summing two lower order distributions: 


Tniytia3 (uw) = tu pone; (uw = nf) = Tipp gat (w) 


Each of the lower order distribution is symmetrical about a different value of U and the sum gives 
a result which is also symmetrical. The algorithm starts with known distribution for i=1 (or j=1) 
and then uses the above equation and symmetry properties to derive the full distribution for i=2 
(or j=2). This procedure is repeated until the distribution for the required value for ¢ = nj, (or 


j=noy). 
After the complete distribution of U is obtained, the one sided and two sided p-values are 


id facia (U) Oa 


PI = J 2 a - : sof 
a oe a Me 


p= 2p 


where |2’| is the floor integer of x. 


Nonparametric Tests Algorithms 


The test statistic corrected for ties is 


, (UO =pr) 
‘ec oF 
where 
_ 1 ff 
os ee a 
5 ny png f n and 9 ¢— /1 and 2,/ = 
2 ae = 12 - Dh 
n n orl — 
stgaiag Land 2./ ) = 
My and 3,f¢ > Pag 1 M2,F 
pa eft 
12 


and M is the total number of distinct rank values. The one sided and two sided p-values are 
respectively 


PH PZ 2 |f|) ST e (7) 


p= 2p 


p. < a@ will reject the null hypothesis and in favor of H, if T<O in favor of H.4 if T>0. p < a will 
reject the null hypothesis in favor of either H', or H4. 


Wald-Wolfowitz Test 
For two independent samples from a continuous field, this tests: 


Ho: F, (x) = F> (x); that is, the two samples are from populations with the same distribution 
function 


Ha F, (x) #4 Fo (ax)for some x 
Calculation of Number of Runs 


Then all observations from the two groups G1 and G2 are pooled and sorted into ascending order. 
The number of changes in the group corresponding to the ordered data is counted. The number of 
runs (R) is the number of group changes plus one. If there are ties involving observations from the 
two groups, both the minimum and maximum numbers of runs possible are calculated. 


Suppose that m distinct values in groups G; and G2 are sorted into ascending order: 


%(1) S 22) S °° * S Lm) 


Nonparametric Tests Algorithms 


Let s;,, and ¢;,¢ be the numbers of records of «:(;) in Gy and G2 respectively, incorporating the 


frequency weight 

i=) Ghia ee) 
JEG, 

and 

ce oy fil (xj = 25) 
JEG» 


Let MinRun and MaxRun be the minimum and maximum number of runs respectively, gj be the 
group indictors at the last run when computing the maximum number of runs, and gz be the group 
indicator when computing the minimum number of runs. Then the following algorithm will 
compute the minimum and maximum number of runs. 


1. MinRun=0, MaxRun=0, g1=0, g2=0, d=0, and i=0 

2. i=i+1. If i>m, stop and output MinRun and MaxRun. 

3. d= 8; —ti,¢, Minim = min(s;,;,t;,,). If Minim=0, then go to step 6. 
4. MarRun = MaxRun+2Minim, MinRun = max (MinRun + 1,2). 


5 IfdAOanddx g; <0, then MarRun = MarRun +1. If d 40, then g; =d. go = —g2. Go 
to step 2. 


4 Ifg.xd<0 or i=1, then MinRun = MinRun+1. Ifg, x d <0, then 
MaxRun = MaxrRun+ 1. g2=d, gi=d. Goto step 2. 


Test statistic and p-value 


Let mip = Veg, fi and 2,5 = )0;<¢, fi. The distribution of the number of runs, R, is 


21 EG, 


approximately normal with 


21 Nf 
ee a ee 
= Nip + N2,F 


2n4 fn, f (2ny prop = f= nop) 


(nip tna)” (mip +r —1) 


oR 


The test statistic is 
z= BER if n, p + nop = 50. Otherwise 
oR sd J 


0 = (R-pRt+0.5)/or if |R—-pR| > 0. 
7 0 if if A -pR| < 0. 
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The one sided p-value is p) = Pr(Z < z) = @(z) orp=Pr(Z < zc) = ®(z¢), but if 

ny. +2, < 30 we use the following exact method to compute the one sided p-value and do not 
use the above approximate normal method even if the test statistic was computed. The one-sided 
exact p-value is calculated from 


R 
pi=Pr(r <R)=S_ fr(r) 
where 

Oe Oe oe nof—1 
fa(r) =~ r/2—-1 r/2—-1 


Ny png fp 
nf 


when r is even and whenr is odd 


Tf —] nof—1 nyf—l ng f —] 
(r —1)/2 (r — 3) /2 (r — 3) /2 (r—1) /2 
0 SSS 
mip +Na,f 
Ge ) 


The conservative decision is made using the biggest number of runs. p; < a will reject the null 
hypothesis. 


Kolmogorov-Smirnov Test 
For two independent samples from a continuous field, this tests: 


Ho: F, («) = F2 (x); that is, the two samples are from populations with the same distribution 
function 


Hy Fy (x) # Fo (x)for some x 
Calculation of the empirical cumulative distribution functions and differences 


For each of the two groups, distinct values are sorted into ascending order: 


a ae. - < rt 
Group 1: @(j) <5) <0 <a (my 


where «’,. € D, and n, is the number of distinct values in G,, 


- es 2: F we 
Group 2: x7}; Kee 


“P(2y) << < "(nay 


where «**, € D2 and ng is the number of distinct values in Gy, 
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Then the empirical cumulative distribution functions for Group 1 and Group 2 are computed as: 


0 —o<@r< 4) 
Al . fil, tSri,,) 
Fy (x) = due a “2 vw} ep) <a< © (41) 
1 Tin.) <2@< co 
and 
0 —o<@r< T(t) 
: vice, Fel {eSaiiy 
Fy (x2) = diveo - is i} Up) SU < UE, 1) 
1 Ens) <@<0o 


where nif = Diiea, fi and nap = Lica, fir 

For each «x ;, the difference between the two groups is 

Ifniy > noy, dj = Fiaj) — Fo (aj) 

If nip < nos, dj = FA rj) Fi (a3) 

The maximum positive, negative and absolute differences are also computed. 
Test statistic and p-value 


The test statistic (Smirnov, 1948) is 


My pnae 


Z = max; |d;| 
“VW nif + Na, fF 


The p-value is calculated using the Smirnov approximation described in the K-S one-sample test. 


p <q rejects the null hypothesis. 


Hodges-Lehmann Estimates 


Here we assume that two samples follow the same distribution except in the location parameters; 
that is, if the first sample follows F («), the secondsample follows F (a + 6). We want to estimate 
and find the confidence interval for 0. 


Let d;; = 2; —2;,i € G,,j € G2. Incorporating the frequency weight f; for «;and f; for, «; the 
frequency weight for di; is f7; = fi fj . Let Bay <... < Bi;),L = ne, be the ordered values of 
d;;, and the corresponding frequency weights are Piivwad cay: 


The Hodges-Lehmann estimator for 6 is 6 = median {B;f*}. 


The Moses’ confidence interval for @ is (B(;.,), Bixs)] - 
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where the median, kj and kare calculated by the same formula as that in the Hodges- 
Lehmann estimate for paired samples (see “Hodges-Lehmann Estimates”) but with yz, and 
Or replaced by the expected value and standard deviation of the test statistics under null 
hypothesis in Mann-Whitney’s 


_ 11, fN2,F 

LT — po 
a Ni,sN2 ar. ( v2 : M 9 

a anita ay (Np (a7 1) i= tis (2, 1)) 


M 2 
= Tx, ¢ M2, ¢ (Nz +1) (1 >. ee) 


2 


Ny = Nn1,f + N2 Ff 


where M is the total number of distinct values among all combined observations, and ¢;,; is the 
number of occurrences of the ith distinct value, incorporating the frequency weight. 


Moses Test of Extreme Reactions 
For two independent samples from a continuous field, this tests: 
Ho: Extreme values are equally likely in both populations 


#4; Extreme values are more likely to occur in the population from which the sample with 
the larger range was drawn. 


Span computation 


Observations from both specified groups are jointly sorted and ranked, with the average rank 
being assigned in the case of ties. The smallest and largest ranks of the control group (the group 
defined by the first value in ascending order) are determined, and the span is computed as 


SPAN = The largest rank of control group-the smallest rank of control group + 1 
If SPAN is not an integer, then it will be rounded to its nearest integer. 


Significance Level 


Let n.,¢ and n,,» be the numbers of records in the control group and experiment group 
respectively, incorporating the frequency weight, and g = SPAN -n...; + 2h. Thenthe exact 
one-tailed probability of spans is 


9 Ieee aan + 2h 4 a] 
i=0 ) Nef —i 
pip SPAN SN ae 8 


Ne f + Nef 
Wey 
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where h=0. The same formula is used below where h is not zero. 


Censoring the Range 


The previous test is repeated, dropping the h lowest and h highest ranks from the control group, 
where his a positive user-specified integer (default at the integer part of 0.05n,..¢ or 1, whichever 
is greater). If 2h > n..r — 2, then the test will be implemented using the largest integer such 
that 2h < (Me,f — 2). 


The exact one-tailed probability is calculated by the formula above, and p; < a rejects the null 
hypothesis. 


Kruskal-Wallis Test 
For k independent samples from a continuous field, this tests: 
Ho: The distributions of the k samples are the same 


H4: At least one sample is different 


Sum of Ranks 


Observations from all k nonempty groups are jointly sorted andranked, with the average rank being 
assigned in the case of ties. The number of records tied at the jth distinct value ¢; ; is calculated 
incorporating the frequency weight, and the sum of 7, ; — tt — {;,; is also accumulated. For 
each group the sum of ranks, 7; , as well as the number of observations, 1 if? is obtained. 


Test statistic and p-value 


The test statistic unadjusted for ties is 


k 


12 9 : 
Np (Np +1) os hs | 


where Vy = $7, nj 7. The statistic adjusted for ties is 


1 H 
H — 1 LL T: \73 - 
— > l Ltt (N3 —N r) 


where m is the total number of tied sets. 


The one-sided p-value is p, = Pr(x2_, > H’) =1+Pr(x?_, <H’), where x?_, follows a 
chi-square distribution with k —1 degrees of freedom. 


pl < awill reject the null hypothesis. 
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Median Test 


For k independent samples from a continuous field, this tests: 


Ho: 6) = 6; = 6) =--- = 6; that is, the k samples are from populations with the same median 


4; At least one population median is different 


Table Construction 
9 is user-specified (default at the sample median of the combined k samples). 


The number of records in each of the groups that exceed the median are counted and the following 
table is formed, where O;;,; denotes the number of records that are less than or equal to the median, 


and O»;_¢ is the number of records that are greater than the median, in the ith group, incorporating 
k 


the frequency weight. nj ¢ = O1;,7 + Oo),,, Rig = > O;;,,and Ny = Ri¢ + Ro 5. 
j=l 


CE CR Cc | 


LE median Oi1,4 


GT median 


Test statisticand p-value 


The \° statistic for all nonempty groups is calculated as 


Ri pris 
Nr 


where £;; = 
If k=2 and ny > 30, Yates’ Continuity Correction for the chi-square statistic is applied 
5 : si gh he 
> (JOars (m2, — Or2,¢) — Orr, ¢ (mp — Orr.) | — Ny /2) Ny 
“(Ooi p + Oon,p) (rap +2,6 — Orig — Oo2,f) M1,972,5 


The one sided p-value is p) = Pr (\7_; = \*) =1—Pr(\j_; < x7), wherey7_, follows a 
chi-square distribution with k — 1 degrees of freedom, where k is the number of nonempty groups. 


p, < @ rejects the null hypothesis. The results may be questionable if any cell has an expected 
value less than one, or more than 20% of the cells have expected values less than five. 


If k=2 and ny < 30, the two sided p-value is computed using Fisher’s exact test. For more 
information, see the topic “Significance Levels for Fisher’s Exact Test”. 
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Jonckheere-Terpstra Test 


For k independent samples from a continuous field, this tests: 


Ho: Fy (@) = Fo (#) =... = Fix (); that is, the k samples are from populations with the 
same distribution function 


Ha: Ha: F(x) > Fo(x) >... > Fe(x)or AH 4: Fy (x) < Fy (a) <... < Fx (x) with at least 
one strict inequality. 


Under the assumption that all distribution functions are the same except the location parameters; 
that is, Fy (w@) = F(a —7,),k& =1,..., 4, the null and alternative hypotheses become: 


Ha:t,) <2 <..< 7K OH 4:71 >>... > TK Withat least one strict inequality. 
For the /th sample and the 4th sample, the Mann-Whitney U count is 

T = a b a 1 rf. ay antes 
Ukike = LieG,, LjeG,, Jifsl (ti < 77) + 5 Viea,, Lyjeq,, fifjl (ei = x5) 


Tn, pM gl + bcd 
= Nb, fk f + Aut 5 piel) — Sp, (hi, ka) 
ez Sp, (ky, kg) _ Nkg, f (Méo f 4 1) 


where ng,¢ = icq, fi and Si, (41, #2) is the sum of ranks of sample ki when 
sample k; and sample kz are jointly ranked incorporating frequency weight; that is, 
Shey (Ay ry kg) = a3 frank (xj, (Dx, ’ Dyy) oy f). 


iG Ry 


The test statistics is 


K-1 K 
BS DD Vis 
ki=1 ko=ki41 


The standardized test statistic is 


pe = T—pr 


where 
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ae = #5 (a; (Ny _ 1) (2Ny + 5) _ es Nef (m4, -_ 1) (2nx,, f + 5) — 2A5—- 5A) 
K 


» Nk, f (nK.F = 1) Ay 
(Soe, ma. s(m4,¢-1) (tte, 2)) (A221) k=1 
ol 36N;(N;—1)(N;—2) a 8N;(N;—1) 


and M is the total number of distinct values among all combined observations, and t;,; is the 
number of occurrences of the ith distinct value considering the frequency weight. 


The one sided and two sided p-values are 


m= PriZ ot" |)=1—9((7")) 


p= 2p 


pi < a will reject the null hypothesis in favor of Ha:Fi(r)<F (a) << FR (x) if 
T* < Oand in favor of H4: Fi (a) > Fo (x) >... > Fe(x) if T* > 0 


p < awill reject the null hypothesis in favor of an ordered alternative (either direction of ordering). 


Note: When there are only two samples, K = 2, the Jonckheere-Terpstra test reduces to the 
Mann-Whitney test. 


One-sided test 


If the direction of the alternative is specified, this becomes a one-sided test. The previously 
defined one-sided p-value is not the p-value for a fixed one-sided test, and cannot be used alone 
to make decision for one-sided test. 


If the alternative is H’ 4: Fy (a) < Fy (a) <... < F(a) the p-value for the one-sided testis 


p=Pr(Z<T*) =a) = 15m T28 


{ Lap 
If the alternative is H4: F, (v) > Fy (a) >... > Fx (a), the p-value for the one-sided test is 


p= PriZ oT") =1=0(7"*) a5, HreG 


Note: The one-sided test will be used in multiple comparisons for Jonckheere-Terpstra test. 
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Related Samples Tests 


The following related samples tests are available. 


McNemar’s Test 


For two related samples from a categorical field with 2 values (or a recoded categorical field 
with more than 2 values), this tests: 


Ho: The two samples have the same marginal distribution. 


Let n,,, be the number of records in which «; is a success and y; is a failure, and nz ; be the 
number of records in which 7; is a failure and y; is a success, incorporating the frequency weights. 


If ny +2,- < 25, the two-sided exact probability is 


ro, 
p=2 2 ( pid =e res Joss +12, fF 


1=—0 
where r = min(1,f, 72,7). 


If nip + n2,p > 25, the test statistic is 


2_ (lrg —reyl- i) 
Be ny p + Naf 


The one sided p-value is 
2 2 2 2 
p=Pr(xg 2x") =1- Prog Sx’) 
where \7 has a chi-square distribution with 1 degree of freedom. 


p<a will reject the null hypothesis. 


Wilcoxon Signed-Rank Test 
For two related samples from a continuous field, this tests: 
Ho: @ = Median(X1—X2) = 0 


Ha: 6 < Oord>0 


Nonparametric Tests Algorithms 


Computing Ranked Differences 


For each record, the difference d; = x;; — x;.is computed, as well as the absolute value. All 
nonzero absolute differences are sorted into ascending order, and ranks are assigned. In the case of 
ties, the average rank is used. Let D = {d; : |d;| # 0} then the sums of the ranks corresponding 
to positive and negative differences are 


Sp = > firank (|d;|;D,£)I (sign (di) > 0) 
ieD 
and 
on = Ss" firank (|d;|;.D,£)I (sign (di) < 0) 
1ED 
respectively. Then the average positive rank and average negative rank are 


Xp = Sp/Np, f 


and 


Xp = Sn/Mn fF 


where n,, is the number of records with positive differences and n,, the number with negative 
differences. 


Test statistic and p-value 


The test statistic is 


hes — ELT 
= ae 
i OP 
where 
= Nf (np | 1) 
MT = 4 


on. =ny (ny 4 1) (2m y + 1) /24 - 3 (t* 5 = tip) /48 
j=l 
np= > fi 
1ED 


where | is the total number of distinct rank values and ¢;, is the number of records tied at the jth 
distinct value, incorporating the frequency weight. 


The one-sided and two-sided p-values are 


Nonparametric Tests Algorithms 


pi =Pr(Z > |L|) =1- ®(T)) 


p= 2p 
p. < q@ will reject the null hypothesis in favor of 6 > Oif T> 0 andé@ < Oif T< 0. 


p <a will reject the null hypothesis in favor of 6 > 0 or 6 < 0. 


Sign Test 
For two related samples from a continuous field, this tests: 
Ho: 6 = Median(X1—X2) = 0 
Ha: ¢6<00r9g>0 
Counting Signs 


For each record, the differenced; = «;; — x;2 is computed and the number of positive(n,,) and 
negative(n,,_) differences, incorporating the frequency weight, are counted: 


n 
nf = >_ fil (di > 0) 
i=] 


n 
Mnf = Soda (dj < 0) 
=I 


Cases with x;; = x; are ignored. 


Test statistic and p-value 
If np. +n, = 0, then the one-sided exact probability is p; = 0.5. 
If0 < np.f + Mn, < 25, then py is calculated recursively from the binomial distribution: 


Ba 
p 


pi = min {p*,p’*} 


where 


Np sf 
p* = ; (i Nn, f Jost 
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p™ =e ‘ (ges + Nn, f ) O.5terthas 


i=0 
If np ¢ + n,¢ > 25, the test statistic is 


_ max Crips ny.) — 0.5 (np.f ! Mnf) — 0.5 


- 0.5,/Np f F Mnf 


The one sided and two sided p-values are 


=|) = 1— ® ([zcl) 


pr=Pr(Z2> 


p= 2py 
pi < a rejects the null hypothesis in favor of @ > Oif np, ¢ > ny, and @< Oif ny ¢ < ny 5. 


p < a will reject the null hypothesis in favor of 9 > 0 or 6 < 0. 


Marginal Homogeneity 
For two related samples from an ordinal field, this tests: 
Ho: The two samples have the same marginal distribution. 


Let 7,,,,,¢ be the cell count incorporating the frequency weight for cell (x) = u,r2 = v) 


Uv, J 
n 
Nuv,f = ) I (Tj1 =U, Ti2 = v) fi 
i=1 
The test statistic is 
t= ) ) WuNuy,f 
Ui vAzU 
The standardized test statistics is 


T — pr 


ee 
Op 


where 


1 
LT = 3 2. Se (Wu + Wy) Nuw,f 


ux~Uu 
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2 
9 Wy — Wy 
on = 5 S ———_] n,,, 
oat i ( ) uv,f 
a 2 


v~U 


The asymptotic one sided p-value is py = Pr(Z > 


T|)=1-9((2")). 


p, <q rejects the null hypothesis in favor of Fy (a) > Fy (x) if T* <0 or Fy (a) < Fy (x) if 
T* > 0 with at least one x gives strict inequality. 


The asymptotic two sided p-value is p = 2p). 


Note: Any linear transformation of scores produces the same standardized test statistic and p-value. 


Hodges-Lehmann Estimates 


For two related samples from a continuous field, this finds a confidence interval for the median 
difference: letting dj; = 2;; —2j2, we assume thatd; follows asymmetric distribution with 
median @. 


Let Ajj = a me sb <p Indofporating the frequency weight f; for d; and /; for dj, the frequency 
weight for A; ; is 


Let Bay <... < Bay). L= ster) be the ordered values of A;;,i < j. and the corresponding 
frequency weights are f (ayo SL): 


The Hodges-Lehmann estimator for 6 is the median of B,,),..., B;,) incorporating the frequency 
weights 
BaytBerss) ; ; Wert : 
P : awe WW) = even andW),) < —2— < Wy) +1 
6 = median{B; f*} = : a Wery +] (kK) * 2 (k) 
Bu) otherwise—4*— € [W_1) + 1,Wey] 


: = ee 
where W,,.) = >-; ality: 


Lu 


The Tukey’s confidence interval for @ is 
where kj and kzare integers such that 


WE [WR -1) 1,Wx,)] 
12 € [Wks—1) =r 1,Wie,)] 


with 


Nonparametric Tests Algorithms 


i = 1+ ler - 2207] 
ig = [per + 2/277 | 


and _«|and |.) are the floor and ceiling integers of x, su7 and o, are the expected value and 
standard deviation of the test statistic T under the null hypothesis in the Wilcoxon signed rank 
test, and =, ,. is the right tail percentile such that Pr(Z > Zg/2) = a/2 where Zis a random 


ma f2 


variate following a standard normal distribution. 


Cochran’s Q Test 


For k related samples from a categorical field with 2 values (or recoded categorical field with 
more than 2 values), this tests: 


Ho: The distributions of these k samples are the same. 


For each record, the number of successes across samples is counted. The number of successes for 


record i is 
k 
Rig = >_ 1 (if aj; is success ) 
j=l 


and the total number of successes for sample I, incorporating the frequency weights, is 
n 
Cy = x fi1 (if x, is success ) 
1=1 
The test statistic is 
I: k k C? ye Cc 5 
(K — 1) [RD aa Chg — (rai Cus 


Q= k 1 n » p2 
kar Chg — Dini LiRj 


The one-sided p-value is 


2 


p= Pr Ott 2 \") = PE (xg-1 < \") 


where \7_, follows achi-square distribution with k — 1 degrees of freedom. 


p <a ejects the null hypothesis. 
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Friedman’s Test 
For k related samples from a continuous field, this tests: 
Ho: The distributions of these k samples are the same. 


For each record, the k samples are sorted and ranked, with average rank being assigned in the 
case of ties. For each sample, the sum of ranks over the records is calculated, incorporating 
the frequency weight, as follows: 


n 


Cif = » firank (2, Dj, f) 


i=1 

whereD; = {;;,j =1,---,k}. The average rank for each sample is 
Ris = Ch s/n 

wherenys =~", fi. 


The test statistic is 


> (12/npk (+1) Ly CPy — Bny (k +1) 


a 1—XT/n fk (k? — 1) 
where 

nomi 
ET = DDL (tig — tas) 

i j=] 


and m, is the total number of distinct rank values of the ith record and t;; is the number of fields 
tied at the jth distinct value of the ith record, incorporating the frequency weight. The one-sided 
p-value is 


3): 


2 2 2 
p= Pega 2 0 1 Pig SX, 
where \z_, follows achi-square distribution with k — 1 degrees of freedom. 


p <a Trejects the null hypothesis. 


Kendall’s Coefficient of Concordance 
For k related samples from a continuous field, this tests: 


Ho: The distributions of these k samples are the same. 


The coefficient of concordance (W) is 
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oo 
= (a=) 


where F is the Friedman \° statistic and n» = )>” 


L2uai=] dt* 


The test statistic is 


The one-sided p-value is calculated as 
2 2 2 2 
p=Pr (Xe-1 > \-) =1-Pr (Xp-1 < x*) 


where \7_, follows a chi-square distribution with k — 1 degrees of freedom. 


p < qa rejects the null hypothesis. 


Multiple Comparisons 


Tests such as Kruskal-Wallis involve more than two samples. They test if all samples are from 
populations with the same characteristics. This characteristic may be the distribution, mean or 
median depending on the hypotheses. Denote the overall null hypothesis as Hy :@; =... =x. 
When this overall hypothesis is rejected at the user-specified significance level a (using 
two-sided p-values except for the Jonckheere-Terpstra test here), we may want to know where 
the differences are among the populations. Two multiple comparison procedures are considered 
to answer this question: pairwise multiple comparisons and a stepwise stepdown procedure 
for multiple comparisons. 


Pairwise Multiple Comparisons 


All possible pairwise hypotheses like Hy; : 6; =; for 1<j<k< K are tested. There are 
K (& — 1) /2 of them. In order to control the familywise type I error; that is, the probability of 
rejecting at least one pair hypothesis given all pairwise hypotheses are true, adjusted p-values are 
calculated and used to make the decision for each pair. For pair (j, k), reject Ho,;;, at level a if 
Padj.jk < a. The adjusted p-values are calculated the following way. 


Calculate the p-value, for each of the pairwise hypotheses. 


Calculate the adjusted p-value as pug; = pA (kK — 1) /2. 
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Notes 


m Ifthe adjusted p-value is bigger than 1, it is set to 1. The calculation of the p-value in step 1 
depends on the specific method used to do the overall test. The details are listed below; in the 
following, two-sided p-values are used except for the Jonckheere-Terpstra test. 


m= The Kruskal-Wallis, Friedman and Kendall, and Cochran tests use the procedure proposed 
by Dunn (1964) (originally designed for the Kruskal-Wallis test). The procedure uses ranks 
(or successes for the Cochran test) based on considering all samples rather than just the two 
involved in a given comparison. 


Kruskal-Wallis test 


Let Rj = Vice, frank (2;, D,f) be the sum of ranks for sample j, incorporating frequency 


weights. 

For testing Ho,;, : 0; =6), the test statistic is Tj, = —~ - . 
2 hy 4 I.f yf 

The standardized test statistic is 777, = Dk where 


jk 


and M is the total number of distinct values among all observations, and ¢;; is the number of 
occurrences of the ith distinct value incorporating the frequency weight. 


The two-sided p-value is 


pin = Pr (IZ1> [Zl) = 20-8 (IP) 


K-sample median test 


For Ho, ;; : 9; =9}, perform the median test using data only consisting of sample j and sample k 
as if other samples don’t exist. In this test the median of the two samples is found, and the number 
above and below that median is used in the test. 


Jonckheere-Terpstra test for ordered alternatives 


For pair (j, k), j<k, the null and alternative hypotheses are Hp j;, :6; = 6; VS Haj, :0; >) if 
the overall alternative hypotheses H4 :@, >... >@x is specified or favored (it is favored if 

T* > 0 where 7” is the standardized test statistic in the overall Jonckheere-Terpstra test), or 

vs Ha jp. 0; <O,if Ha: 0, <... < Ox is specified or favored (it is favored if 7* < 0). Use 
Mann-Whitney’s U test on each pair of hypotheses to calculate the p-value for the one-sided test. 


Friedman’s test and Kendall’s Coefficient of Concordance 


For treatment j, let Rj, = 5°", firank (;;,rowi of D) be the sum of ranks for sample j. 


Lu 


For testing Hy, j, : 0; —0, the test statistic is Tj, — So Pes, 


n f 


K(K+1) 


On ¢ 


The standardized test statistic is 7, = 4* 


where g?= 
The two-sided p-value is pj, = Pr (Z| > l! A ) = 2(1 -@ (| Ii, !))- 


Cochran’s Q Test 


Using x;; = 1 to represent success and «|; = 0 to represent failure, let Cj; = pats , /iziz be the 
total number of successes for sample j incorporating frequency weights, and R; = S~ : "_, x; be the 
total number of successes for record i. : 


ay 


The test statistic for Ho j, : 0; =O, is Tj, = Ge 


The standardized test statistic is 7’, = ~* where 


nok (WN —1) 


The two-sided p-value is pj, = Pr (Z| > l! ih ) = a(1 -@® ( I, !))- 


Stepwise Stepdown Multiple Comparisons 


The procedure described in this section is an extension of the ad hoc procedure developed by 
Campbell and Skillings (1985). This procedure starts with the overall hypothesis involving all K 
populations, and if the hypothesis is rejected, then it considers the sub-hypotheses involving K—-1 
populations, continuing until the hypothesis only involves two populations or no hypotheses are 
rejected. If all sub-hypotheses are considered, it may be computationally too expensive when K is 
big, so a shortcut is used on the sorted samples. This procedure returns a sequence of subsets of 
populations with homogeneous characteristics. 


Sort the samples 


The K samples are sorted from the smallest to largest by test-specific criteria. Let (1), ..., (K) 
index the sorted samples. 


m= Kruskal-Wallis: average treatment rank, where rank is the joint rank of all the observations. 
Use the treatment median to break ties. 


= Median: treatment median 
m= Jonckheere-Terpstra test: given by the user-specified alternative hypothesis order. 


m Friedman: average treatment rank (same as using treatment rank sum) where rank is the joint 
within row/block ranking. 
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@ Kendall’s coefficient of concordance: same as Friedman 


™ Cochran’s Q: average treatment mean, which is the same as using the treatment sum. 


Find the homogeneous subsets 
Starting with sample (1), sequentially test Hp : (1) = @(2), then Hp : 6/1) = A/2) =0/3), and so 


on, until the null hypothesis is rejected when sample (j) is added. Samples (1) through (j—-1) 
are considered homogenous. The process repeats starting with sample (j) and continues until 


sample (K). 
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If a WEIGHT variable is specified, it is used to replicate a case as many times as indicated by 
the weight value rounded to the nearest integer. If the workspace requirements are exceeded and 
sampling has been selected, a random sample of cases is chosen for analysis using the algorithm 
described in SAMPLE. For the RUNS test, if sampling is specified, it is ignored. The tests are 
described in (Siegel, 1956). 


One-Sample Chi-Square Test 


Cell Specification 


If the (lo, hi) specification is used, each integer value in the lo to hi range is designated a cell. 
Otherwise, each distinct value encountered is considered a cell. 


Observed Frequencies 


If (lo, hi) has been selected, every observed value is truncated to an integer and, if it is in the lo to 
hi range, it is included in the frequency count for the corresponding cell. Otherwise, a count of 
the frequency of occurrence of the distinct values is obtained. 


Expected Frequencies 


If none or EQUAL is specified, 


number of observations (.\ )[in range] 


il a number of cells (/:) 


When the expected values (£7) are specified either as counts, percentages, or proportions, 


If there are cells with expected values less than 5, the number of such cells and the minimum 
expected value are printed. 


If the number of user-supplied expected frequencies is not equal to the number of cells generated, 
or if an expected value is less than or equal to zero, the test terminates with an error message. 


Chi-Square and Its Degrees of Freedom 


hs S (0; = EXPY 


EXP, 
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The significance level is from the chi-square distribution with k — 1 degrees of freedom. 
Kolmogorov-Smirnov One-Sample Test 


Calculation of Empirical Cumulative Distribution Function 


The observations are sorted into ascending order X,,)toX,,;). The empirical cdf, F(X), is 


C. <i x < Xin 
F(X) =< i/N X@<X< Xu) t=1,...,N—-1 
1 (ny) =X < o 


Estimation of Parameters for Theoretical Distribution 


It is possible to test that the underlying distribution is either uniform, normal, or Poisson. If the 
parameters are not specified, they are estimated from the data. 


Uniform 


Poisson 


N 
mean (\) — S X;/N 


i=] 


The test is not done if, for the uniform, all data are not within the user-specified range or, for the 
Poisson, the data are not non-negative integers. If the variance of the normal or the mean of 
the Poisson is zero, the test is also not done. 


Calculation of Theoretical Cumulative Distribution Functions 


For Uniform 


NPAR TESTS Algorithms 


: X; —min 
Fy(X;) = ————— 
max — mln 
For Poisson 
a e > 
F(X) =" 


l=0 


If } > 100,000, the normal approximation is used. 


For Normal 


: Ler 
Fo( Xi) = Foy (=) 


where the generation of Fg 4(Z) is described in “Significance Level of a Standard Normal Deviate”. 


Calculation of Differences 


For the Uniform and Normal, two differences are computed: 


Dj = F(Xj-1) — Fo( Xi) 
D; = F(X) — Fo(Xi) 1 =1,...,4 \ 


For the Poisson: 

Hea) BOGS) SP SA) OG SO! TS dead N, 
i iG =0 

D; = F(X;) — F(Xi) 


The maximum positive, negative, and absolute differences are printed. 


Test Statistic and Significance 
The test statistic is 
Z= VNmax;(|D1), \0\]) 
The two-tailed probability level is estimated using the first three terms of the Smirnov (1948) 


formula. 


tO 2 £027, pt ae ; 
if 0.27 = VA < 1, p= {i= ote (Q ob Q° Ba Q?°) 


— aie -2 
where Q = e7 1} 2887012" | 
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bieZ <3. p=2QoC +e"? =o") 

where Q = e722. 

ifZ>3.1, p=0 

Note: If the distribution is normal and parameters are estimated from the data, then the Lilliefors 
method is used to compute the test statistic and p value instead of the method described in this 


section. For more information, see the topic “Kolmogorov-Smirnov Statistic with Lilliefors’ 
Significance”. 


Runs Test 
Computation of Cutting Point 
The cutting point which is used to dichotomize the data can be specified as a particular number, or 


the value of a statistic which is to be calculated. The possible statistics are 
N 


Mean = D X;/N 


i=1 


Median — (X, N/2+1) + X(/2) )/2 if N is even 
X((N+1)/2) if N is odd 


where the data are sorted in ascending order from X,), the smallest, to X,,;), the largest. 
Mode = most frequently occurring value 

If there are multiple modes, the one largest in value is selected and a warning printed. 
Number of Runs 

For each of the data points, in the sequence in the file, the difference 


D; = X; -CUTPOINT 


is computed. IfD; > 0 , the difference is considered positive, otherwise negative. The number 
of times the sign changes, that is, D; > 0 and D;,,; <0,orD;<0 and D;,,> 0 , as well as 
the number of positive (n p) and (n,,) signs, are determined. The number of runs (/2) is the 
number of sign changes plus one. 


Significance Level 


The sampling distribution of the number of runs (/) is approximately normal with 
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9 
2NpNa 


[by = Np+Na + 1 


2NpNa(2ZNpNa—Na—Ny) 
(np+na) (Nptna—-1) 


O, = 


The two-sided significance level is based on 


z= R= [ly 


Op 
unless n < 50; then 
(R— pr +0.5)/o, if R—p, < 0.5 


Ze = ¢ (R— py —0.5)/o, if R—p, > 0.5 
0 if |R — pr| < 0.5 


Binomial Test 


Table 70-1 
Notation 
Notation Description 
m1 Number of observations in the first (test) category 
ne Number of observations in the second category 
Pp Test probability 
m min (1, N2) 
N ni + ne 
Pp piim=m,1—pifm=ne 


When the test probability is equal to 0.5, a two-tailed test is performed. The two-tailed probability 
is 


we T aes 
min { 1,2 S- e 5. 
i-0 


When the test probability is not equal to 0.5, a one-tailed test is performed. The one-tailed 
probability is 


McNemar’s Test 


Table Construction 
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The data values are searched to determine the two unique response categories. If the variables 
X and Y take on more than two values, or only one value, a message is printed and the test is not 
done. The number of cases that have XY; < Y;(n 1) Or X;> Yj(nz) are counted. 


Test Statistic and Significance Level 


Ifnm, +n < 25, the exact probability of r or fewer “successes” occurring in n; + m2 trials when 
p=0.5 and r = min (mj, 72) is calculated recursively from the binomial. 


: 
WX <r) =), C ; m2 Voy 
i=0 


The two-tailed probability level is obtained by doubling the computed value. If n; + ny > 25,a 
\° approximation with a correction for continuity is used. 


2 (|m—ne| 
_— I 


2 au df=1 


dg 


Sign Test 
Count of Signs 
For each case, the difference 
Dp= Xe =; 


is computed and the number of positive (7,,) and negative (7,,) differences counted. Cases in 
which X; = Y; are ignored. 


Test Statistic and Significance Level 


Ifn,, +, < 25, the exact probability of r or fewer “successes” occurring in n,, + n,, trials, when 
p=0.5 and r = min(n,, /t,,), is calculated recursively from the binomial 


a 

PX <r)= a & i ™ (0.5) vas 
1=0 

Ifn, +n, > 25, the significance level is based on the normal approximation 

7 = Max (Np, Nn) — 0.5(Npy + Nn) — 0.5 

0.5,/Np F Nn 


A two-tailed significance level is printed. 


Wilcoxon Matched-Pairs Signed-Rank Test 


Computation of Ranked Differences 
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For each case, the difference 
D; = X;-Y; 


is computed, as well as the absolute value of D;. All nonzero absolute differences are then sorted 
into ascending order, and ranks are assigned. In the case of ties, the average rank is used. The 
sums of the ranks corresponding to positive differences (5,,) and negative differences (S,,) are 
calculated. The average positive rank is 


r os 
Xp = Sp/Np 


and the average negative rank is 


a 
Xn = Sn/Mn 


where n,, is the number of cases with positive differences and n,, the number with negative 
differences. 


Test Statistic and Significance Level 
The test statistic is 


min (Sp, Sn) — (n(n + 1)/4) 


Z= 
l 
n(n +1)(2n +.1)/24—S— (t? — t))/48 
0 

where 
Table 70-2 
Notation 

Notation Description 

n Number of cases with non-zero differences 

l Number of ties 

£5 Number of elements in the j-th tie, 7 = 1,...,1 


For large sample sizes the distribution of Z is approximately standard normal. A two-tailed 
probability level is printed. 


Cochran’s Q Test 
Computation of Basic Statistics 


For each of the N cases, the k variables specified may take on only one of two possible values. If 
more than two values, or only one, are encountered, a message is printed and the test is not done. 
The first value encountered is designated a “success” and for each case the number of variables 
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that are “successes” are counted. The number of “successes” for casei will be designated 2; and 
the total number of “successes” for variable | will be designated C’,. 


Test Statistic and Level of Significance 


Cochran’s Q is calculated as 


2 


k k 
(k — 1) kyo CP - ( «i 
l=1 l=1 


The significance level of Q is from the ,° distribution with k — 1 degrees of freedom. 


Friedman’s Test 
Sum of Ranks 


For each of the N cases, the k variables are sorted and ranked, with average rank being assigned in 
the case of ties. For each of the k variables, the sum of ranks over the cases is calculated. This 
will be denoted as Ci. The average rank for each variable is 


R, = Ci /N 


Test Statistic and Significance Level 


The test statistic is 


k 
(12/NK(k+1))$_ C7? —3N(k +1) 
l=1 
1— ST /Nk(k? — 1) 


bo 


where D7 is the same as in Kendall’s coefficient of concordance. See (Lehmann, 1985) p. 265. 


The significance level is from the ,? distribution with £ — 1 degrees of freedom. 


Kendall’s Coefficient of Concordance 
N, k, and! are the same as in Friedman, in the previous section. 


Coefficient of Concordance (W) 
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ae F NP kth 1) 12 
~ (<4 - 5) N2k(k2 — 1)/12 — NXT/12 


where F = Friedman ‘statistic. 


with t = number of variables tied at each tied rank for each case. 
Test Statistic and Significance Level 
=N(k-DW 


The significance level is from the \° distribution with k — 1 degrees of freedom. 


The Two-Sample Median Test 
Table Construction 


If the median value is not specified by the user, the combined data from both samples are sorted 
and the median calculated. 


Md= { (Xtwja) + Xpvjosay)/2 # NS even 


X(N +1)/2] otherwise 


where X(N] is the largest value and Xi the smallest. The number of cases in each of the two 
groups which exceed the median are counted. These will be denoted as g; and gz, and the 
corresponding sample sizes as n, and no. 


Test Statistic and Significance Level 
m= If N < 30, the significance level is from Fisher’s exact test. (See Appendix 5.) 


= If NV > 30, the test statistic is 


2 [lgi(m2 — 92) — ge(mi — gi) — N/2)°N 
(gi + ga)(n1 +n2 — gi — g2)nin2 


which is distributed as a \? with 1 degree of freedom. 


Mann-Whitney U Test 


Calculation of Sums of Ranks 


NPAR TESTS Algorithms 


The combined data from both groups are sorted and ranks assigned to all cases, with average rank 
being used in the case of ties. The sum of ranks for each of the groups (S; and S9) is calculated, as 


well as, for tied observations, 7, = 45", where t is the number of observations tied for rank i 
The average rank for each group is 


8. = 5:/% 
where 7; is the sample size in group i. 


Test Statistic and Significance Level 


The U statistic for group 1 is 


+1 
U =nyng+ min) — $1 


m IfU > n,n2/2, the statistic used is 


U =nyng —U 


B® Tfnyng < 400 and nyng/2 + min (nm), 2) < 220 the exact significance level is based on an 
algorithm of Dineen and Blakesley (1973). 


= The test statistic corrected for ties is 


(U —nynz/2) 
ina_ | N3—N | 
vin (ie - Yon] 
4 


which is distributed approximately as a standard normal. A two-tailed significance level is printed. 


Z= 


Wilcoxon Rank Sum W Statistic 


If U > nyn2/2, then W=S; otherwise W=Sp. 


Kolmogorov-Smirnov Two-Sample Test 
Calculation of the Empirical Cumulative Distribution Functions and Differences 


For each of the two groups separately the data sorted into ascending order, from Xj1) to -X;,,,), and 
the empirical cdf for group zis computed as 


For all of the X; values in the two groups, the difference between the two groups is 


NPAR TESTS Algorithms 


Dy = F\(Xj) — F(X;) 


where F’ ( X;) is the cdf for the group with the larger sample size. The maximum positive, 
negative, and absolute differences are also computed. 


Test Statistic and Level of Significance 


The test statistic (Smirnov, 1948) is 


7 Re NN» 
Z =max;|Dj|,/ 7 
Jj 


and the significance level is calculated using the Smirnov approximation described in the K-S 
one sample test. 


Wald-Wolfowitz Runs Test 


Calculation of Number of Runs 


All observations from the two samples are pooled and sorted into ascending order. The number of 
changes in the group numbers corresponding to the ordered data are counted. The number of runs 
(R) is the number of group changes plus one. 


If there are ties involving observations from the two groups, both the minimum and maximum 
number of runs possible are calculated. 


Significance Level 


If n, + ng, the total sample size, is less than or equal to 30, the one-sided significance level 
is exactly calculated from 


R, ; 
2 ni—1 ‘ane 
(r< R) ees > (Mena) 

ny _ 
when R is even. When R is odd 

1 . 1 1 1 1 

a Ne ny ng — nyt nz — 

rem SM net) (8) (HA) (2) 

1tn2 \ 529 

oS) 

where 


r=2k—-1. 


For sample sizes greater than 30, the normal approximation is used (see “Runs Test”). 


NPAR TESTS Algorithms 


Moses Test of Extreme Reaction 
Span Computation 
Observation from both groups are jointly sorted and ranked, with the average rank being assigned 
in the case of ties. The ranks corresponding to the smallest and largest control group (first group) 
members are determined, and the span is computed as 
SPAN = Rank(Largest Control Value) — Rank(Smallest Control Value) + 1 
rounded to the nearest integer. 
Significance Level 
The exact one-tailed probability level is computed from 


3 t+ne—2h-2)\(ne+2h+1-1 
1 Ne —1 


P(SPAN < n.—2h+g)=e2 
é + ) 
Ne 


where h = 0, n,. is the number of cases in the control group, and n, is the number of cases in the 
experimental group. The same formula is used in the next section where h is not zero. 


Censoring of Range 


The previous test is repeated, dropping the h lowest and h highest ranks from the control group. If 
not specified by the user, h is taken to be the integer part of 0.05, or 1, whichever is greater. If 

h is user specified, the integer value is used unless it is less than one. The significance level is 
determined as in the previous section. 


K-S ample Median Test 
Table Construction 


If the median value is not specified by the user, the combined data from all groups are sorted 
and the median is calculated. 


Md = (X72) + XN/2-41) ) /2 if/Vis even 
: X(N -+1)/2] ifNis odd 


where nj} 1s the largest value and X;,; the smallest. 
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The number of cases in each of the groups that exceed the median are counted and the following 
table is formed. 


Group 1 Group 2 Group 3 ee Group k 
LE Md O1s Ri 
GT Md ?2 P22 Ors 2 ak Ra 
NL n2 ng 4 Ne N 


Test Statistic and Level of Significance 


The \° statistic for all nonempty groups is calculated as 


The significance level is from the \° distribution with k — 1 degrees of freedom, where k is the 
number of nonempty groups. A message is printed if any cell has an expected value less than one, 
or more than 20% of the cells have expected values less than five. 


Kruskal-Wallis One-Way Analysis of Variance 
Computation of Sums of Ranks 


Observations from all k nonempty groups are jointly sorted and ranked, with the average rank 
being assigned in the case of ties. The number of tied scores ina set of ties, ,tis also found, 
and the sum of T, = ¢’ —¢; is accumulated. For each group the sum of ranks, F;, as well as 
the number of observations, n;, is obtained. 


Test Statistic and Level of Significance 


The test statistic unadjusted for ties is 


12 : 2 
= i i N4 
ee 72 R2/nj—3(N +1) 


i=1 


where N is the total number of observations. 
Adjusted for ties, the statistic is 


A 


7 i yo n/(n — N) 


i=] 


H 
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where m is the total number of tied sets. 


The significance level is based on the \” distribution, with k — 1 degrees of freedom. 
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For post hoc range tests and pairwise multiple comparisons, see Post Hoc Tests. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 71-1 
Notation 


Notation Description 
X1j Value of the jth observation in group | 


why Weight for the jth observation in group | 


Wi; Sum of weights of the first j cases in group / 

WwW Sum of weights ofall cases in group! 

kj Number of groups, determined as maximum group values minus minimumplus one 
ke Number of nonempty groups 

ny Number of cases in group / 

W Sum of weights of casesin all groups 


Group Statistics 


The following group statistics are available. 


Computation of Group Statistics 


A weighted version of the Young-Cramer (1971) algorithm is used to compute recursively the 
corrected sum of squares for each group. 


wi | XuWii-1 — Yo wy Xy 
SeGie=SoQige : =e 
dt, dti-1 + Wau, 


The initial value is 0; the value for each group after the last observation has been processed is the 
corrected sum of squares. 


rofey e—aw iri Oyen 


The sum and mean for each group are 
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T,=T,/Wi 
The variance is 
S? = §S,/(W; — 1) 
The grand sum is 
k 
=r 
i=] 


Group Statistics from Summary Statistics 


With matrix data input, the user supplies sum of weights in each group (11), means (77), and 
standard deviations (S;). From these, 


T, =W,T; 


If the user supplies the pooled variance $3 and its degrees of freedom () instead of the individual 
Si, and D < 1, the program will reset it to 


k 
D=S Wik 
l=1 
The within-group sum of squares is 


WSS = SD 


The ANOVA Table 
Table 71-2 
ANOVA table 


Source of SS df 
Variation 


Between (BSS) 


k 
SOT? /m — G?/W 
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Source of SS df 
Variation 


Within (WSS) R 
S- SS; 
= (D) 
(s; Dfor matrix input) 
Total(TSS) BSS+WSS W-il 


Mean squares are calculated by dividing each sum of squares by its degree of freedom. The 
F ratio for testing equality of group means is 


Mean Square Between BSS A/ 


Mean Square Within = JJ SSM 


The significance level is obtained from the F distribution with numerator and denominator 
degrees of freedom. 


Basic Statistics 


The following basic statistics are available. 
Descriptive Statistics 

Sample size = W, 

Mean — 7, 

Standard deviation = S, 


— 


Standard error — Sq// Wa 


95% Confidence Interval for the Mean 
Te =e tw,-15q/V/ Wa 


where fy, _; is the upper 2.5% critical value for the t distribution with 1, — 1 degrees of freedom. 


Variance Estimates and Confidence Interval for Mean 


Computation depends upon whether a fixed-effects or random-effects model is fit. 
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Fixed-Effects Model 


Fixed-effects factors are generally thought of as variables whose values of interest are all 
represented in the data file. 


Pooled Standard Deviation 


Sp = VWSSM 


Standard Error 


Standard error = \/WSSM/W 


95% Confidence Interval for the Mean 


G+ty_po/WSSM/W 


wheret,,-_, isthe upper 2.5% critical value for the t distribution with IV — k degrees of freedom. 


Random-Effects Model 


Random-effects factors are variables whose values in the data file can be considered a random 
sample from a larger population of values. They are useful for explaining excess variability 
in the dependent variable. 


Between-Groups Component of Variance (Snedecor and Cochran 1967) 


» (BSSM—WSSM)(W(k —1)) 


7 k 
(we-yowa) 
1=1 


Standard Error of the Mean (Brownlee 1965) 
k 
S_W? | (k — 1)(BSSM - WSSM) 
= “ WSSM 


k W 
(we -y-1?) 


i=] 


iW 


If BSSM <WSSM, V(G) = +42“ and a warning is printed that the variance component 
estimate is negative. 
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95% Confidence Interval for the Mean 
G+tpy_1/V(G) 
wheret,’_, is the upper 2.5% critical value for the ¢ distribution with i: —1 degrees of freedom 


Levene Test for Homogeneity of Variances 


L = t | 
kon . 
(kh —1) wit (Zi — Zi) 
i=1 l=1 
where 
Zi |Xg=7,| 
Ni 
S| wa Zin 
a= l=1 
: W; 
S wiz; 
= = as 
W 


User-Supplied Contrasts 


Let C through C;, be the coefficients for a particular contrast. If the sum of the coefficients is 
not 0, a warning is printed and the contrast number is starred. For each contrast the following 
are printed. 


Value of the Contrast 
k —— 
V= », (ar GF 
i=l 


Pooled Variance Statistics 


The following statistics are computed. 
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Standard Error 


k 
aaa 12/117 
SE =| S2)_C?/W; 
i=1 


t Value 

t=V/SE 

Degrees of Freedom 

Wk 

And atwo-tailed significance level based on the t distribution with W-— k’ degrees of freedom. 
Separate Variance Statistics 


The following statistics are computed. 


Standard Error 


t Value 
i=V/SE 
Degrees of Freedom (Brownlee 1965) 


k 
(Sen) 


1=1 


2 


And atwo-tailed significance level based on the t distribution with df degrees of freedom 


Polynomial Contrasts (Speed 1976) 


If the specified degree of the polynomial (NP) is less than or equal to 0, or greater than 5, a 
message is printed and the procedure is terminated. If the degree of the polynomial specified is 
greater than the number of nonempty groups, it is set tok — 1. If the sums of the weights in each 
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group are equal, only the WEIGHTED contrasts willbe generated. For unequal sample sizes with 


equal spacing between groups, both WEIGHTED and UNWEIGHTED contrasts are computed. 


For unequal sample sizes and unequal spacing, only WEIGHTED contrasts are computed. The 
metric for the polynomial is the group code. 


UNWEIGHTED Contrasts and Statistics 


The coefficients for the orthogonal polynomial are calculated recursively from the following 
relations: 


Ciq = (i Ag)¢ig 1 C4Ci,g 2 


v4 


for 


with the initial values 


G-1=9, co=l 


and 
k 
2 
oS Ci q-l 
et 
Aq = > 
2 
ye ed 
i=1 
k 
2 
ees 
Y — 
Cy = +—__ forq>2 
2 
>I Sa2 
i=l 
G=0 forq —1 


The F statistic for the qth degree contrast is computed as 
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2 


k 
5 Ge 


e i 1 
al - — — 4 yes 


Sod g/Wi 


i=1 


where WSSM is the mean square within. The significance level is obtained from the F distribution 


with 1 and W —k’ degrees of freedom. 


WEIGHTED Contrasts and Statistics (Emerson 1968; Robson 1959) 


The contrast for the qth degree polynomial component is computed from the following recursive 


relations: 


dig = (i- Aq) dig — C gdig—2 


with initial values 


dio = 1,dj,-1 =0 


k 


So iWidi 1 


A’ — i=l 
A, = =1—__—_ 


k 
> Wid? g-1 
i=l 


k 


S > iWidig—14i,g—2 


C, == ; forq > 2 
S > Widiq-2 
i=] 


CG. = 0 forq=1 


The test for the contribution of the qth degree orthogonal polynomial component is based on 


F = D,/WSSM 
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where 


2 


k 
( WF) 


~ k 
> 72 
S| Wid? 

i=1 


q= 


The significance level is computed from the F distribution with degrees of freedom 1 and |W —k.. 
The test for deviation from the qth degree polynomial is based on 


F = DD,/WSSM 


where 


DD, = { BSS-_D; | /(k -a-1) 


qd 
=] 


J 


The significance level is computed from the F distribution with degrees of freedomk —q—1 and 


Ww- ke The highest degree printed will be the minimum of Ga a 2) and 5. 
Multiple Comparisons (Winer 1971) 


Generation of Ranges 


The Student-Newman-Keuls (SNK), TUKEY, and TUKEYB procedures are all based on the 
studentized range, S;,.; where r is the number of steps between means and f is the degrees of 
freedom for the within-groups mean square. For the above tests, only q — 0.05 can be used. 


The appropriate range of values for the tests are 


’ 


SNK. t=" Op fs ee 


TUKEY. R= Sy ; 


TUKEYB. R _ (Srst$y1 5) 
Re = 5 


For the DUNCAN procedure, alphas of 0.01, 0.05, and 0.10 can be used. The ranges (J, ;) are 
generated using the algorithm of Gebhardt (1966). 


r 


DUNCAN. R, = D,.;,r=2,...,k 
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The Scheffé, LSD, and modified LSD procedures all use critical points from the F distribution. 
Any a@ < 0.5 can be used. 


SCHEFFE. R,. = \/2(k — 1)Fi_a(k’ —1,f) 
LSD. R, = \/2F\_a(1, f) 
MODLSD. ?,. = J 2Fi-a'| Lf) 


where 


Mj = A(2 | L) (default ) 


ni n, 


(harmonic mean for all groups ) 


Establishment of Homogeneous Subsets 


If the sample sizes in all groups are equal, or the harmonic mean for all groups has been selected, 
or the multiple comparison procedure is SNK or DUNCAN, homogeneous subsets are established 
as follows: 


The means are sorted into ascending order from 71) to T K’): Values of i and q such that 


T, Tw| < Ry i+1Mq,i (+) 


q) 
are systematically searched for and 


[mpm al 
il (ajo + 7- a | (@)} 
is considered a homogeneous subset. The search procedure is as follows: 


At each step t, the value of i is incremented by 1 (the starting value is 1), and g = k. The value of 
q is then decremented by one until (*) is true. Call this value q;. If gq, > q—, and(*) is true, 


{Fs bul ae 


is considered homogeneous. Otherwise i is incremented and the next step is done. The procedure 
terminates when i = k or q =k. 
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In all other situations, all nonredundant pairs of groups are compared using the criteria of (*). 
A table containing all pairs of groups is printed with symbols indicating group means that are 
significantly different. 


Welch Test 


In Welch (1947,1951), he derived the an approximate test for equality of means without the 
homogeneous variance assumption. The statistic is given by 


k 


—_ wad 
Sou (7 - x) /(k— 1) 
(=1 
FW eich = ieee sone a. Se 
2(k—2) wy ; 
T+ Tat), (1 aa = /(W = 1) 
l=1 
k k 
where w) = W7/ Sis aa 2 w), and X = ~ wT / u. 
[=1 [=1 


The Welch statistic has an approximate F distribution with k-1 and f degrees of freedom, where 


: 3 < COE es 
r= |p 0-3) jni=3) 


Since the weight used in Welch statistic is w, = W/.S7, one cannot compute the statistic if any 
one group has zero standard deviation. Moreover, sample sizes of all groups have to be greater 
than or equal to zero. 


—l 


Brown-Forsythe Test 


In (Brown and Forsythe, 1974a) and (Brown and Forsythe, 1974b), a test statistic for equal means 
was proposed. The statistic has the following form, 


k 
"Wi (Ti - G)’ 
l=1 

k 


So (1 — W/W) $7 


l=1 


FRr = 


The statistic has an approximate F distribution with (k-1) and f degrees of freedom, where 
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and 


(1 —W7/W) S$? 
k 


S71 - W/W) S7 


I=1 


C= 


When we look at the denominator of Fg, we can see that it tries to estimate the ‘pooled variance’ 
by 
k 
2 _ yk O2 
Ppael =. | Sj 
ak 


where 


~  (W-W) 
wt W (k—-1) 


The Brown & Forsythe statistic cannot be computed if all groups have zero standard deviation or 
any group has sample size less than or equal to 1. In the situation that some groups have zero 
standard deviations, the statistic can be computed but the approximation may not work. 
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OPTIMAL BINNING Algorithms 


The Optimal Binning procedure performs MDLP (minimal description length principle) 
discretization of scale variables. This method divides a scale variable into a small number of 
intervals, or bins, where each bin is mapped to a separate category of the discretized variable. 
MDLP is a univariate, supervised discretization method. Without loss of generality, the 
algorithm described in this document only considers one continuous attribute in relation to a 
categorical guide variable — the discretization is “optimal” with respect to the categorical guide. 
Therefore, the input data matrix S contains two columns, the scale variable A and categorical 


guide C. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


S(i) 


Simple MDLP 


The input data matrix, containing a column of the scale variableA anda 
column of the categorical guide C. Each rowis a separate observation, or 
instance. 


A scale variable, also called a continuous attribute. 
The value of A for the ith instance in S. 

The numberof instances in S. 

A set ofall distinct valuesin S. 

A subset of S. 


The categorical guide, orclass attribute; it is assumed to havek 
categories, orclasses. 

A cut point that defines the boundary between two bins. 

A set ofcut points. 

The class entropy of S. 

The class entropy of partition induced by Ton A. 

The information gain of the cut point Ton A. 

A parameter denoting thenumberof cut points forthe equal frequency 
method. 

A weight attribute denoting the frequency of each instance. If the weight 
values are not integer, they are rounded to the nearest whole numbers before 


use. Forexample, 0.5 is rounded to 1, and 2.4 is rounded to 2. Instances 
with missing weights orweights lessthan0.5are not used. 


This section describes the supervised binning method (MDLP) discussed in Fayyad and Irani 


(1993). 


Class Entropy 


Let there be k classes Cj, .. 


., Gand let P(C;, S) be the proportion of instances in S that have 


class Cj. The class entropy Ent(S) is defined as 
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k 
Ent (S) =—5_ P(Ci,S) logs (P (Ci, $)) 
i=l 
Class Information Entropy 


For an instance set S, a continuous attribute A, and a cut point T, let S; c Sbe the subset of 
instances in S with the values of A<T, and S> = S—S. The class information entropy of the 
partition induced by T, E(A, T; S), is defined as 


Information Gain 


Given a set of instances S, a continuous attribute A, and a cut point T on A, the information 
gain of a cut point T is 


Gain (A,T;S) = Ent (S) — E(A,T;S) 


MDLP Acceptance Criterion 


The partition induced by acut point T for aset S of N instances is accepted if and only if 


— logs(N-1 A(A,T;S 
Gain (A,T:S) eo we(Va) } Sl 


and it is rejected otherwise. 


Here A (A,7; 5) = log, (3* —2)— [k- Ent (S) — ky Ent (S,) — ko Ent (S2)] in which k; is the 
number of classes in the subset Sj of S. 


Note: While the MDLP acceptance criterion uses the association between A and C to determine 
cut points, it also tries to keep the creation of bins to a small number. Thus there are situations in 
which a high association between A and C will result in no cut points. For example, consider the 
following data: 


Then the potential cut point is T= 1. In this case: 


Gain (A,T;S) = 0.5916728 


OPTIMAL BINNING Algorithms 


logy (N - A(A,T;S 2 
welt) ~ ar) = 0.653074 


Since 0.5916728 < 0.6530774, T is not accepted as a cut point, even though there is a clear 
relationship between A and C. 


Algorithm: BinaryDiscretization 


1 


Calculate E(A, d;; S) for each distinct value dj; € D for which d; and dj, do not belong to the same 
class. A distinct value belongs to a class if all instances of this value have the same class. 


Select acut point T for which E(A, T; S) is minimum among all the candidate cut points, that is, 


T =argming, E(A,d;; 5) 


Algorithm: MDLPCut 


1. 


BinaryDiscretization(A, T; D, S). 


2. Calculate Gain(A, T; S). 


3. If a; .@) ~ logg(V—1) A(A,T:S) then 
Gain(A,T;S) > Saas ane nan ae 


a) T4 = TaUT 
b) Split D into Dy and Do, and S into Sj and Sp. 
c) MDLPCut(A, Ta; Dy, $4). 


d) MDLPCut(A, Ta; Do, Sz). where Sj CS be the subset of instances in S with A-values < T, and 
S2 = S—S1. D1 and D? are the sets of all distinct values in Sj and So, respectively. 


Also presented is the iterative version of MDLPCut(A, Ta; D, S). The iterative implementation 
requires a stack to store the D and S remaining to be cut. 


First push D and S into stack. Then, while ( stack#@ ) do 
Obtain D and S by popping stack. 

BinaryDiscretization(A, T; D, S). 

Calculate Gain(A, T; S). 

Wai, (A,T; 8) > loegiN=1) SATS) then 

i) Ta = TaUT 

ii) Split D into D; and Dg, and S into S; and Sp. 

iii) Push D, and Sj into stack. 


iv) Push D2 and S2 into stack. 
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Note: In practice, all operations within the algorithm are based ona global matrix M. Its element, 
mj, denotes the total number of instances that have value dj € Dand belong to the jth class in S. 
In addition, D is sorted in ascending order. Therefore, we do not need to push D and S into stack, 
but only two integer numbers, which denote the bounds of D, into stack. 


Algorithm: SimpleMDLP 


1. Sort the set S with N instances by the value A in ascending order. 
2. Find a set of all distinct values, D, in S. 
3. Ta =@. 
4. MDLPCut(A, Ta; D, S) 
5. Sort the set T, in ascending order, and output Ta. 
Hybrid MDLP 


When the set D of distinct values in S is large, the computational cost to calculate E(A, dj; S) 
for each dj € D is large. In order to reduce the computational cost, the unsupervised equal 
frequency binning method is used to reduce the size of D and obtain a subset Dep € D. Then the 
MDLPCut(A, Ta; Ds, S) algorithm is applied to obtain the final cut point set T,. 


Algorithm: EqualFrequency 


It divides a continuous attribute A into n bins where each bin contains N/n instances. nis a 
user-specified parameter, where 1 <n< N. 


Sort the set S with N instances by the value A in ascending order. 


Deg = Q 


jel. 


Use the aempirical percentile method to generate the dp,jwhich denote the (~ x 100)th 
percentiles. 


dD, f= dD, gU ana: i=i+1 
If i<n, then go to step 4. 


Delete the duplicate values in the set Deg. 


Note: If, for example, there are many occurrences of a single value of A, the equal frequency 
criterion may not be met. In this case, no cut points are produced. 


Algorithm: HybridMDLP 


1. 


D= ©; 


Model 
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EqualFrequency(A, n, D; S). 
Ta = ©, 
MDLPCut(A, Ta; D, S). 


Output Ta. 


Entropy 


The model entropy is a measure of the predictive accuracy of an attribute A binned on the class 
variable C. Given aset of instances S, suppose that A is discretized into I bins given C, where 
the ith bin has the value A;. Letting S; < S be the subset of instances in S with the value Aj, the 


model entropy is defined as: 


I J 
Em = > P(Ai) | — > P (C)lA;)loggP (Cj Aa) 
4 j=l 


where P(A,) = [il and P(C)[A;) = 4 = P(C;, Si). 


S| P(A;) 


Merging Sparsely Populated Bins 


Occasionally, the procedure may produce bins with very few cases. The following strategy deletes 
these pseudo cut points: 


For a given variable, suppose that the algorithm found nga) Cut points, and thus ngjjqit1 bins. For 
bins i= 2, ..., Ng¢jnqi (the second lowest-valued bin through the second highest-valued bin), compute 


sizeof (b;) 
min(sizeof(b;_1),sizeof(bj41)) 


where sizeof(bin) is the number of cases in the bin. 


When this value is less than a user-specified merging threshold, }; is considered sparsely 
populated and is merged with b;_; or b;,1, whichever has the lower class information entropy. For 
more information, see the topic “Class Information Entropy”. 


The procedure makes a single pass through the bins. 


Example 


The following example shows the process of simple MDLP using an artificial data set S with 250 
instances. S is not shown here, but can be reconstructed (sorted in ascending order of values of A) 
from the matrix M below. 


First, sort S by the value of A in ascending order. Then find a set, D, of all distinct values in S. 


\D| = 46. 
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D= {-2.6, -2.4, -2.1, -2, -1.9, -1.8, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1, -0.9, -0.8, -0.7, -0.6, 
-0.5, -0.4, -0.3, -0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 
1.6, 1.7, 1.8, 1.9, 2, 2.1, 2.3} 


Compute the frequencies of instances with respect to each class for each distinct value dj € D and 
construct a matrix M. Its element, mj, denotes the total number of instances that have value dj 
and belong to the jth class. 


Table 72-1 
2-Dimensional matrix M 


Ce En a 
a 


[a Joo} 7 is fe fo oo 
Fea ee (Cr (Ce (ee (Ce ee (| | 
be a ee ee | a ee ee | |e 
Fa a a 2 (2 6 


MDLPCut(A, Ta; D, S) 


Calculate E(A, dj; S) for each dj € D for which djand dj+1 do not belong to the same class. 


a Eo 


Ae di; D, S) 1.4742 0.5955 -———— 9038 


Ta = {-0.1} 


Dy = {-2.6, -2.4, -2.1, -2, -1.9, -1.8, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1, -0.9, -0.8, -0.7, -0.6, 
-0.5, -0.4, -0.3, -0.2, -0.1} 


S; = {all instances with A-values < -0.1} 
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Dy = { 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 
2, 2.1, 2.3} 


S> = {all instances with A-values > -0.1} 


Calculate E(A, d;; S1) for each d; € Dy for which dj and d;+1 do not belong to the same class. 


E(A, dj; Di, S1) 
Ta = {-0.1, -2.1} 


Dy,1 = {-2.6, -2.4, -2.1} 


Sj, = {all instances with A-values between -2.6 and -2.1} 


Dj o> 1-2, -L9,.-1.8, =1.7,91.6/-15, =14, -1.3, -1.2,-101,-1,+0.9)+0.8,;+0.7,-0.6,-0.5; =0-4; 
-0.3, -0.2, -0.1} 


Sj, = {all instances with A-values between -2 and -0.1} 
All instances in S; 1 belong to the same class, thus S; j can’t be split further. 
All instances in S;\9 belong to the same class, thus S19 can’t be split further. 


Calculate E(A, dj; S2) for each dj € D2 for which dj and d;+1 do not belong to the same class. 


E(A, dj; D2, S2) 


Ta = {-0.1, -2.1, 0.9} 


D2, = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} 

S91 = {all instances with A-values between 0 and 0.9} 

Dog = tly Ty 12; 13,4 551.6; 1.7.1.8, 19,2, 2.1, 2.3} 

S22 = {all instances with A-values between 1 and 2.3} 

All instances in S21 belong to the same class, thus S21 can’t be split further. 


All instances in S99 belong to the same class, thus S99 can’t be split further. 
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ORTHOPLAN Algorithms 


This procedure generates an orthogonal main-effects design. It will find the smallest orthogonal 
plan to fit the factors having at least as many combinations as requested. 


Selecting the Plan 


From a library of prepared plans, select the shortest plan that can be adapted to the design and that 
satisfies the minimum size requirement provided by the user. If no plan exists that satisfies the 
minimum size requirement, pick the largest plan that can be adapted. 


Adapting the Prepared Plans 


Generating Multiple Factors from One Column 


A four-level factor can be transformed into three two-level factors using the rule in the following 


table. 

Table 73-1 

Converting a four-level factor to three two-level factors 
Original Code A B C 
0 0 0 0 
1 0 1 1 
2 1 0 1 
3 1 1 0 


An eight-level factor can be transformed into seven two-level factors using the rule in the 
following table. 


Table 73-2 

Converting an eight-level factor to seven two-level factors 
Origind A B C D E F G 
Code 
0 0 0 0 0 0 0 0 
1 1 0 0 1 1 0 1 
2 0 1 0 1 0 1 1 
3 1 1 0 0 1 1 0 
4 0 0 1 0 1 1 1 
5 1 0 1 1 0 1 0 
6 0 1 1 1 1 0 0 
7 1 1 1 0 0 0 1 


A nine-level factor can be transformed into four three-level factors using the rule in the following 


table. 
Table 73-3 
Converting a nine-level factor to four three-level factors 
Original Code A B C D 


0 0 0 0 0 
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Original Code 


TmNI RDO BR WHY BE 
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Changing the Number of Levels in a Column 


Any factor of m levels can be transformed into a factor of n<m levels by many-to-one mappings 
without changing its orthogonality. Any mapping can be used; i mod n is used here. 


Library of Prepared Plans 


This section describes previously developed plans. 


Plackett-Burman Plans 


Plackett and Burman (1946) describe a series of plans that can be generated from a single column 
by rotation. The general algorithm for generating any of these plans is: 


= Let L be the number of levels for which the plan is designed. No factor in the specific design 
can have more than L levels. 


= Let N be the number of rows (combinations) finally to be generated. Note that N=F+1 where 
F is defined below. 


m™ Starting witha given column of N—1 level codes, rotate one position to generate each new 
column. 


m Finally, adda row of zeroes. 


Sit , orthogonal columns can be generated in this fashion. 


The Plackett-Burman plans used here are designated PBL.F, where L is the maximum number of 
levels and F is the number of factors: 


Label Generating Column 

PB 2.7 11101 00 

PB 2.11 11011 10001 0 

PB 2.15 11110 10110 01000 

PB 2.19 11001 11101 01000 0110 

PB 2.23 11111 01011 0110 01010 000 

PB 2.31 00001 01011 10110 00111 11001 10100 1 

PB 2.35 01011 10001 11110 111 00 10000 10101 10010 


PB 2.43 11001 01001 11011 11100 01011 10000 01000 11010 110 
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Label Generating Column 

PB 2.47 11111 01111 00101 01110 01001 10110 00101 01100 00100 00 

PB 2.59 11011 10101 00100 11101 11100 11111 00000 11000 01000 11011 01010 
0010 

PB 3.4 01220 211 

PB 3.13 00101 21120 11100 20212 21022 2 

PB 3.40 01111 20121 12120 20221 10201 10012 22021 00200 02222 10212 21210 
10112 20102 20021 11012 00100 

PB 5.6 04112 10322 42014 43402 3313 

PB 7.8 01262 21605 32335 20413 11430 65155 61024 54425 03646 634 


Addelman Plans 


Addelman (1961) described general methods for generating orthogonal main effects plans. That 
paper included a number of such designs, and using those methods, the authors generated more. 


Table 73-4 
18 rows, 7 columns of 3 levels each 
0000000 0021011 
0112111 0100122 
0221222 0212200 
1011120 1002221 
1120201 1111002 
1202012 1220110 
2022102 2010212 
2101210 2122020 
2210021 2201101 
Table 73-5 
8 rows, 1 column of 4 levels plus 4 columns of 2 levels 
0 0000 
0 1111 
1 0011 
1 1100 
2 0101 
2 1010 
3 0110 
3 1001 
Table 73-6 
16 rows, 5 columns of 4 levels each 
00000 02231 
10111 12320 
20222 22013 
30333 32102 
01123 03312 


11032 13203 
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21301 
31210 


Table 73-7 


32 rows, 9 columns of 4 levels each 


000000000 
011231111 
022312222 
033123333 
101111032 
110320123 
123203210 
132032301 
202223102 
213012013 
220131320 
231300231 
303332130 
312103021 
321020312 
330211203 


Table 73-8 


64 rows, 21 columns of 4levels each 


000000000000000000000 
111111111111111100000 
222222222222222200000 
333333333333333300000 
123012301230123012301 
0321032103210321 12301 
3012301230123012 12301 
2103210321032103 12301 
231023102310231023102 
3201320132013201 23102 
013201320132013223102 
1023102310231023 23102 
312031203120312031203 
203120312031203131203 
130213021302130231203 
021302130213021331203 
000111122223333011110 
111000033332222111110 
2223333000011112 11110 
3332222111100003 11110 
123103223013210003211 
032012332102301103211 


23130 
33021 


002130213 
013301302 
020222031 
031013120 
103021221 
112210330 
121333003 
130102112 
200313311 
211122200 
222001133 
233230022 
301202323 
310033232 
323110101 
332321010 


000222233331111022220 
1113333222200001 22220 
2220000111133332222 20 
3331111000022223 22220 
123230132101032030121 
032321023010123130121 
301012310323210230121 
210103201232301330121 
231201331021320001322 
320310220130231101322 
013023113203102201322 
102132002312013301322 
312213030211203013023 
2033021213003121 13023 
1300312120330212 13023 
0211203031221303 13023 
000333311112222033330 
111222200003333133330 
222111133330000233330 
333000022221111333330 
123321010322301021031 
032230101233210121031 


301321001231032203211 
2102301103201233032 11 
231132020133102032012 
320023131022013132012 
013310202311320232012 
102201313200231332012 
312120321303021020313 
203031230212130120313 
1303021031212032 20313 
0212130120303123 20313 


301103232100123221031 
210012323011032321031 
231310213202013010232 
3202013023131021 10232 
013132031020231210232 
102023120131320310232 
312302112032130002133 
203213003123021102133 
130120330210312202133 
021031221301203302133 
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Table 73-9 
16 rows, 1 column of 8 levels plus 8 columns of 2 levels 
0 00000000 0 11111111 
1 01010101 1 10101010 
2 00001111 2 11110000 
3 01011010 3 10100101 
4 00111100 4 11000011 
5 01101001 5 10010110 
6 00110011 6 11001100 
7 01100110 7 10011001 
Table 73-10 
31 rows, 1 column of 8 levels plus 8 columns of 4 levels 
0 00000000 0 22222222 
1 01230123 1 23012301 
2 02021313 2 20203131 
3 03211230 3 21033012 
4 00113322 4 22331100 
5 01323201 5 23101023 
6 02132031 6 20310213 
7 03302112 7 21120330 
0 11111111 0 33333333 
1 10321032 1 32103210 
2 13130202 2 31312020 
3 12300321 3 30122103 
4 11002233 4 33220011 
5 10232310 5 32010132 
6 13023120 6 31201302 
7 12213003 7 30031221 
Table 73-11 
64 rows, 9 columns of 8 levels each 
000000000 202222222 404444444 606666666 
011234567 213016745 415670123 617452301 
022456713 220647531 426021357 624203175 
033651274 231473056 437215630 635037412 
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044517326 246735104 
055723641 257501463 
066172435 264350617 
077346152 275164370 
101111111 303333333 
110325476 312107654 
123574602 321756420 
132140365 330562147 
145406237 347624015 
154632750 356410572 
167063524 356241706 
176257043 374075261 
Table 73-12 
27 rows, 1 column of 9 levels plus 9 columns of 3 levels 
0 000000000 3 

0 112121212 3 

0 221212121 3 

1 000111122 4 

1 112202001 4 

1 221020210 4 

2 000222211 5 

2 112010120 5 

2 221101002 5 
Table 73-13 
81 rows, 10 columns of 9 levels each 
0000000000 0336258147 
1011111111 1347036258 
2022222222 2358147036 
3033333333 3360582471 
4044444444 4371360582 
5055555555 5382471360 
6066666666 6303825714 
7077777777 7314603825 
8088888888 8325714603 
0112345678 0448561723 
1120453786 1456372804 
2101534867 2437480615 
3145678012 3472804156 
4153786120 4480615237 
5134867201 5461723048 
6178012345 6415237480 
7186120453 7423048561 
8167201534 8404156372 
0221687354 0557813462 


440153762 
451367205 
462536071 
473702516 
505555555 
514761032 
527130246 
536304721 
541042673 
990276314 
563427160 
572613407 


011001111 
120122020 
202210202 
011112200 
120200112 
202021021 
011220022 
120011201 
202102110 


0663174285 
1674285063 
2685063174 
3606417528 
4617528306 
5628306417 
6630741852 
7641852630 
8652630741 
0775426831 
1783507642 
2764318750 
3718750264 
4726831075 
5707642183 
6742183507 
7750264318 
8731075426 
0884732516 


642371540 
653145927 
660714253 
671520734 
707777777 


7165432210 


725312064 
734126503 
743260451 
752054136 
761605342 
770481625 


mao AeA NINN DMDAD 


022002222 
101120101 
210211010 
022110011 
101201220 
210022102 
022221100 
101012012 
210100221 
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1202768435 1538624570 1865840327 
2210876543 2546705381 2873651408 
3254021687 3581246705 3827165840 
4235102768 4562057813 4808273651 
5243210876 9970138624 5816084732 
6287354021 6524570138 6851408273 
7268435102 750538 1246 7832516084 
8276543210 8513462057 8840327165 


Decision Rules 


Each value of L (the maximum number of levels in the design) has a distinct decision rule. In their 
descriptions, the following notation is used: 


M The user-supplied minimum number of rows desired in the plan 
F The numberof factors in the design 
L=2 


If all factors have two levels, simply select the smallest two-level Plackett-Burman plan for which 
Noptan 2 max ( M, F+i1 } 


L=3 
Let P = the number of factors with more than two levels, and let K=F+2P. 


If M<9 and F<6 and P<2, base the plan on Table 73-5 “8 rows, 1 column of 4 levels plus 4 
columns of 2 levels”. 


If M<10 and F<5, base the plan on PB 3.4. 

Otherwise, if M<17 and K<16, base it on Table 73-6 “16 rows, 5 columns of 4 levels each”. 
Otherwise, if M<19 and K<8, base it on Table 73-4 “18 rows, 7 columns of 3 levels each”. 
Otherwise, if M<28 and K<14, base it on PB 3.13. 

Otherwise, if M<65 and K<22, use the rules for L=4. 

Otherwise, if F’<41, base the plan on PB 3.40. 


If F>40, there are too many factors. 


L=4 


Let P = the number of factors with more than two levels, and let K=F+2P. 
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If M<9 and F<6 and P<2, base the plan on Table 73-5 “8 rows, 1 column of 4 levels plus 4 
columns of 2 levels”. 


Otherwise, if M<17 and K<15, base it on Table 73-6 “16 rows, 5 columns of 4 levels each”. 
Otherwise, if M<26 and K<19, base it on PB 5.6. 

Otherwise, if M<33 and K<28, base it on Table 73-7 “32 rows, 9 columns of 4 levels each”. 
Otherwise, if M<49 and K<23, use the rules for L=7. 

Otherwise, if K<64, base the plan on Table 73-8 “64 rows, 21 columns of 4 levels each”. 
Otherwise, there are too many factors. 


A four-level factor can be transformed into three two-level factors using the rule in Table 
73-1 “Converting a four-level factor to three two-level factors”. 


L=5 

Create a plan based on the L=7 rules. 

If that plan has 26 or more rows and M<26 and F<’7, base the plan on PB 5.6. 
Otherwise, use the plan generated in step 1. 

L=6 

Treat this case as L=7. 

L=7 

Generate the best plan based on L=8. 


If that plan has more than 49 rows and M<50 and F<9, base the plan on PB 7.8. 


Otherwise, use the plan generated in step 1. 


L=8 


Let P be the number of factors with more than two levels, and Q be the number of factors with 
more than four levels. 


If M<17 and F<10 and P<2, then base the plan on Table 73-9 “16 rows, 1 column of 8 levels 
plus 8 columns of 2 levels”. 


Otherwise, if M<28 and F’<11 and only one factor has more than three levels, base the plan 
on the L=9 rules. 
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Otherwise, if M<33 and Q<2 and F+2P+4Q<32, base the plan on Table 73-10 “31 rows, 1 column 
of 8 levels plus 8 columns of 4 levels”. 


Otherwise, if M<65 and F+6P<64, base it on Table 73-11 “64 rows, 9 columns of 8 levels each”. 


Otherwise, base the plan on the L=9 rules. 


An eight-level factor can be transformed into seven two-level factors using the rule in Table 
73-2 “Converting an eight-level factor to seven two-level factors”.. 


L=9 
Let P be the number of factors with more than three levels, and K=F+3P. 


If M<28 and F<11 and P<2, then base the plan on Table 73-12 “27 rows, 1 column of 9 levels 
plus 9 columns of 3 levels”. 


Otherwise, if K<41, base it on Table 73-13 “81 rows, 10 columns of 9 levels each”. 
Otherwise, there are too many factors. 


A nine-level factor can be transformed into four three-level factors using the rule in Table 
73-3 “Converting a nine-level factor to four three-level factors”. 


Randomization 


After a basic plan has been selected, columns are selected at random (if possible) to fit the given 
design. If the basic plan is asymmetric; that is, one column has more levels than the others, then 
the factor in the plan with many levels must be assigned to the factor in the design with many 
levels, and the remaining plan factors must be assigned randomly to the remaining design factors. 

If factors are to be transformed into multiple factors (for example, eight-level factors 
transformed into two-level factors), you can randomly assign columns from the plan to design 
factors with many levels first, then transform the remaining columns, and then select from the 
transformed columns at random the columns needed. 
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OVERALS Algorithms 


The OVERALS algorithm was first described in Gifi (1981) and Van der Burg, De Leeuw and 
Verdegaal (1984); also see Verdegaal (1986), Van de Geer(1987), Van der Burg, De Leeuw and 
Verdegaal (1988), and Van der Burg (1988). Characteristic features of OVERALS, conceived by 
De Leeuw (1973), are the partitioning of the variables into K sets and the ability to specify any 
of a number of measurement levels for each variable separately. Analogously to the situation in 
multiple regression and canonical correlation analysis, OVERALS focuses on the relationships 
between sets; any particular variable contributes to the results only inasmuch as it provides 
information that is independent of the other variables in the same set. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Number of cases (objects) 
m Number of variables 
p Number of dimensions 
K Number of sets 
For variable j; 7 = 1, ...,m 
kj Number of valid categories (distinct values) of variable j 
G; Indicator matrix for variable j, of order n x k; 
: _ f1 when the th object is in the rth category of variable j 
G(j)ir 0 when the «th object is not in the rth category of variable 7 
D; Diagonal matrix, containing the univariate marginals; that is, the column 
sums of G; 
Forsetk; k=1,..., 4h 
J(k) Index set of the variables thatbelong to set k (sothat you canwrite j € J(k)) 
m; Number of variables in set k (number of elements in J(k)) 
M; Binary diagonalnxn matrix, with diagonal elements defined as 
x _ f 1. when the «th observation is within the range [1,/:;] forall 7 € J(k) 
‘)"" "| 0 when the zh observation outside the range [1,/:;] forall 7 € J(h) 


The quantification matrices and parameter vectors are: 


Xx Object scores, of order nxp 

Xj Auxiliary matrix of ordernxp, with corrected object scores when fitting 
variable j 

Y; Category quantifications for multiple variables, oforderk; x p 


Yj Category quantifications for single variables, of order k:; 
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aj Variable weights forsingle variables, of order p 

Qs Quantified variables of thekthset, ofordern x m,, with columns q; = Gjy,; 

Y Collection of multiple and single category quantifications across variables 
and sets. 


Note: The matrices M;,, G;, M;, and D; are exclusively notational devices; they are stored 
in reduced form, and the program fully profits from their sparseness by replacing matrix 
multiplications with selective accumulation. 


Objective Function Optimization 


The OVERALS objective is to find object scores X anda set of Y ; (for j=1,...,m) — the 
underlining indicates that they may be restricted in various ways — so that the function 


o(X;¥)=1/K) tr : a > G)Y,} M,{X- Ss G;Y, 
k 


JES) JEd(k) 


is minimal, under the normalization restriction X M..X = knI where M, = rs M,, and I is the 


pp identity matrix. The inclusion of M;, in o(X; Y) provides the following mechanism for 
weighting the loss: whenever any of the data values for object i in set k falls outside its particular 
range [1, /:;], a circumstance that may indicate either genuine missing values or simulated missing 
values for the sake of analysis, all other data values for object i in set k are disregarded (listwise 
deletion per set). The diagonal of M., contains the number of “active” sets for each object. The 
object scores are also centered; that is, they satisfy u’ M, WX = 0 with u denoting an n-vector 
with ones. 


Optimal Scaling Levels 


The following optimal scaling levels are distinguished in OVERALS: 

Multiple Nominal. Y ; = Y; (equality restriction only). 

(Single) Nominal.Y ; = yja ; (equality and rank — one restrictions). 

(Single) Ordinal.Y ; = y; a ; and y; € C; (equality, rank — one. and monotonicity restrictions). 
The monotonicity restriction y; € Cj; means that y; must be located in the convex cone of all 
k;-vectors with nondecreasing elements. 

(Single) Numerical. Y ; = yja ; and y; € L; (equality, rank — one, and linearity restrictions). The 


linearity restriction y; € L; means that y; must be located in the subspace of all k;-vectors that 
are a linear transformation of the vector consisting of k; successive integers. 
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For each variable, these levels can be chosen independently. The general requirement for all 
options is that equal category indicators receive equal quantifications. The general requirement 
for the non-multiple options is Y; = yja j; that is, Y; is of rank one; for identification purposes, 
y,; is always normalized so that y ;Diy; = Nw. 


Optimization 
Optimization is achieved by executing the following iteration scheme: 
1. Initialization I or II 
2. Loop across sets and variables 
3. Eliminate contributions of other variables 
4. Update category quantifications 
Update object scores 
Orthonormalization 


Convergence test: repeat (2) through (6) or continue 


t,o 


Rotation 


Steps (1) through (8) are explained below. 


Initialization 
I. Random 


The object scores X are initialized with random numbers. Then X is normalized so that 

uM, WX = Oand X M,X = kn, yielding X For multiple variables, the initial category 
quantifications are set equal to 0. For single variables, the initial category quantifications y; are 
defined as the first i; successive integers normalized in sucha way that uD )¥; = 0 and 
y;Dj;y¥; =, and the initial variable weights are set equal to 0. oa 


II. Nested 


In this case, the above iteration scheme is executed twice. In the first cycle, (initialized with 
initialization I) all single variables are temporarily treated as single numerical, so that for the 
second, proper cycle, all relevant quantities can be copies from the results of the first one. 


Loop across sets and variables 


The next two steps are repeated for k=1,...,K and all j € J(/). During the updating of variable j, 
all parameters of the remaining variables are fixed at their current values. 


Eliminate contributions of other variables 


For quantifying variable j in set k, define the auxiliary matrix 
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Vi‘ io ie a(k \G;Y ;-G;Y,; 


which accumulates the contributions of the other variables in set k; then in (X — V,,,) js the 
contributions of the other variables are eliminated from the object scores. This device enables 
you to write the loss o(X; Y ,) as a function of X and Y ; only: 


o(X;¥,) = constant + 1/ Ker | (Vip jee) Mia vel= G)Y,)| 


With fixed current values X the unconstrained minimum over Y; is attained for the matrix 


Y;= (G'jM.G,) G'jM,(X SMG i) 


which forms the basis of the further computations. When switching to another variable ! in the 
same set, the matrix V,;.); is not computed from scratch, but updated: 


Via — Vong +Gi¥; - GY, 
Update category quantifications 


For multiple nominal variables, the new category quantifications are simply 


For single variables one cycle of an ALS algorithm (De Leeuw et al., 1976) is executed for 
computing the rank-one decomposition of Y;, with restrictions on the left-hand vector. This cycle 
starts from the previous category quantification y; with 


| 7! ~ 

aj = Y¥,D; yj 

When the current variable is numerical, we are ready; otherwise we compute 
x ¥ : 

yj = Yja, < 


Now, when the current variable is single nominal, you can simply obtain y } by normalizing 
y; in the way indicated below; otherwise the variable must be ordinal, and you have to insert 
the weighted monotonic regression process 


y;— WMON(” ). 
J yj 


The notation WMONC ) is used to denote the weighted monotonic regression process, which 
makes y; monotonically increasing. The weights used are the diagonal elements of D and the 


subalgorithm used is the up-and-down-blocks minimum violators algorithm (Kruskal, 1964; 
Barlow et al., 1972). The result is normalized: 


; \-1/2 
1/24 f—'* e\o 
Yj = hw yj (y ;Diy;) 


| 


Finally, weset Y'=yja,. 
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Update object scores 

First the auxiliary score matrix W is computed as 
We W+ MAD je3(4) Gi; 

and centered with respect to M,: 

ASS {I — M,uu /u M a} W 


These two steps yield locally the best updates when there would be no orthogonality constraints. 


Orthonormalization 


The problem is to find an M..-orthonormal X* that is closest to M7 'X* in the M.,-weighted least 
squares sense. In OVERALS, this is done by setting 


Xt < m!/?M7"/?PROCRU (Mz "/?X") 

The notation PROCRU( ) is used to denote the Procrustes orthonormalization process. If 

the singular value decomposition of the input matrix M, '’”X* is denoted by KA'L, with 
KK=I,LL=LandA diagonal, then the output matrix KL’ = M,;" LAL! satisfies 
orthonormality in the metric M.,. The calculation of L and A is based on tridiagonalization with 
Householder transformations followed by the implicit QL algorithm (Wilkinson, 1965). 


Convergence test 


The difference between consecutive values of A‘ is compared with the user-specified 
convergence criterion ¢ - asmall positive number. After convergence, the badness-of-fit values is 
also given. Steps (2) through (6) are repeated as long as the loss difference exceeds s. 


Rotation 


The OVERALS loss function o(X;Y) is invariant under simultaneous rotations of X and Y. It 
can be shown that the solution is related to the principal axes of the average projection operator 


t -1 +, 
Q. = 1/K™,M,.Q .(Q .M..Q t) Q’,.M. 


In order to achieve principal axes orientation, which is useful for purposes of interpretation and 
comparison, it is sufficient to find a rotation matrix that makes the cross-products of the matrix 
M,'*xX* diagonal - a matrix identical to the one used in the Procrustes orthonormalization in step 
(6). In the terminology of that section, we rotate the matrices X ‘, Y , and the vectors a; with the 
matrix L. The rotation matrix Lis taken from the last PROCRU operation as described in step (6). 


Diagnostics 


The following diagnostics are available. 
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Maximum Rank 


The maximum rank p,,,.x indicates the maximum number of dimensions that can be computed for 
any dataset (if exceeded, OVERALS adjusts the number of dimensions if possible and issues a 
message). In general, 


, — fmin{(n-1),11,r2} ifk =2 
fmax ~V minf{(n—1), maxr;,} if K > 2 


where the quantities r;, are defined as 


r= y kj + Mp, — MEQ 


jJEIM(k) 


Here m,,.; is the number of multiple variables with no missing values in set /:, m);,. is the number 
of single variables in set k, and/J.\/(k) is an index set recording which variables are multiple in set 
k. Furthermore, OVERALS stops when any one of the following conditions is not satisfied: 


1 Te < np -1 


2. np > 2 


3. » Th Ss Une (Me 7 l= (Mmax = 1) 
k 


Here n;. denotes the number of nonmissing objects in set k, and m,,.; denotes the maximum 
across all of 7;,. 


Marginal Frequencies 
The frequencies table gives the univariate marginals and the number of missing values (that is, 


values that are regarded as out of range for the current analysis) for each variable. These are 
computed as the column sums of D; and the total sum of My, forj € J(f). 


Fit and Loss Measures 


In the Summary of Analysis, loss and fit measures are reported. 


Loss Per Set 


This is K times o(X; Y), partitioned with respect to sets and dimensions; the means per dimension 
are also given. 


Eigenvalue 


The values listed here are 1 minus the means per dimension defined above, forming a partitioning 
of FIT, which is p — o(X; Y) when convergence is reached. These quantities are the eigenvalues 
of Q. defined in section (8). 
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Multiple Fit 


This measure is computed as the diagonal of the matrix Y ;Dj Y ;, computed for all variables 
(rows) with dimensions given in the columns. 


Single Fit 


This table gives the squared weights, computed only for variables that are single. The sum of 
squares of the weights: a fag 


Single Loss 


Single loss is equal to multiple fit minus single fit for single variables only. It is the loss incurred 
by the imposition of the rank-one measurement level restrictions. 


Component Loadings (for Single Variables) 


Loadings are the lengths of the projections of the quantified (single) variables onto the object 
space: q ;X. When there are no missing data, the loadings are equal to the correlations between 
the quantified variables and the object scores (the principal components). 


Category Quantifications 
Single Coordinates. For single variables only: Y; =y, al J 


Multiple Coordinates. These are x; defined previously; that is, the unconstrained minimizers of 
the loss function, for multiple variables equal to the category quantifications. 


Category Centroids 


The centroids of all objects that share the same category, D; G ;X. Note that they are not 
necessarily equal to the multiple coordinates. 


Projected Category Centroids 


For single variables only, yjb j. These are the points ona line in the direction given by the 
loadings b; that result from projection of the category centroids with weights Dj. 
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PARTIAL CORR Algorithms 


PARTIAL CORR produces partial correlation coefficients that describe the relationship between 
two variables while adjusting for the effects of one or more additional variables. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 75-1 

Notation 

Notation Description 

N Number of cases 

Xx Value of variable k for case! 

wy Weight forcase! 

Wij Sum of the weights of cases used in computation of statistics for variable i andj 

Wi Sum of the weights of cases used in computation ofstatis tics for variable i 
Statistics 


Zero-Order Correlations 


N N N 
iS weXiX jI - (>. an vi) [> Xi) [Wij 


res l=1 l=1 l=1 


N N 4 N N 7 
s wy Xa _ (>. Xi) [Wij De wi X5, - (>. Xi) [Wij 


l=1 i=1 t=1 


which, under the null hypothesis, is distributed as at with 1’,; — 2 degrees of freedom. By default, 
one-tailed significance levels are printed. 


Means and Standard Deviations 


N 


xy = 3S wiXji/W; 


i=l 


N 


S > wiX?, — XjW; | /(Wj - 1) 


i=] 


WD 
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If pairwise deletion is selected, means and standard deviations are based on all nonmissing cases. 
For listwise deletion, only cases with no missing values on any specified variables are included. 


Partial Correlations 


Partial correlations are calculated recursively from the lower-order coefficients using 


Tig — Teal bs 
ea TS (first order) 


(172) (=r, 


Vijk —TiLKT GLK 
ri.) = — "second order) 
2 a 
laste) (1 - Hi) 


and similarly for higher orders ((Morrison, 1976) p. 94). 


20 


If the denominator is less than 10~~”, or if any of the lower-order coefficients necessary for 

calculations are system missing, the coefficient is set to system missing. If a coefficient in absolute 

value is greater than 1, it is set to system missing. (This may occur with pairwise deletion.) 
Significance Level 


The significance level is based on 


df 
1-—r? 


t=r 


The degrees of freedom are 
df = M-@-2 


where @ is the order of the coefficient and M is the minimum sum of weights from which the 
zero-order coefficients involved in the computations were calculated. Thus, for 7°; j.;./ 


M = min (Wij, Weis Weis Wit, Wie, Wyn) 


where lV; is the sum of weights of the cases used to calculated 1; ;. If listwise deletion of missing 
values (default) was used, all ;; are equal. By default, the significance level is one-tailed. 
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Partial least squares (PLS) regression fits a model for one or more dependent variables based upon 
one or more predictors. It is especially useful when the predictors exhibit multicollinearity, or 
there are more predictors than cases. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 76-1 
Notation 
Notation 
X 


a ae ee a 


Preprocessing 


Description 
N Xn design matrixof independentvariables, centered and perhaps standardized. 
Note that there is no interceptterm. 


N x mmatrix of dependent variables, centered and perhapsstandardized 
mx 1 column vector of weights 

N x 1 column vectorof Y scores 

n X 1 column vectorof weights 

N x 1 column vector of X scores 

number of PLS factors to extract 

n xX 1 loading vector 

m x 1 loading vector 

n x d loading matrix 

m x d loading matrix 

N x dscore matrix, T = X W* 

N x dscore matrix 

n X d matrix of X-weights 

n X d matrix of X-weights in original coordinates; these weights can be directly 
applied to X, W* = w(P'w) 

mx d matrix of Y-weights; these weights can be directly applied to Y. 
n X mmatrix of regression parameters, B =W*C 

N X nmatrix ofresiduals, E= X — TP’ 

N X m matrix ofresiduals, F = Y— UQ’ = Y — XB 

N x 1 vectorof distances of X variables to the model 

N x 1 vectorof distances of Y variables to the model 

n x dmatrix of Variable Importance in the Projection 


The following steps are performed before the estimation algorithm commences. 


Design Matrix 


The design matrix X is constructed from the independent variables as in GLM models without an 


intercept. 
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Categorical Variable Encoding 


The procedure temporarily recodes categorical dependent variables using one-of-c coding for the 
duration of the procedure. If there are c categories of a variable, then the variable is stored as 

c vectors, with the first category denoted (1,0,...,0), the next category (0,1,0,...,0), .... and the 
final category (0,0,...,0,1). 


Categorical dependent variables are represented using dummy coding; that is, simply omit the 
indicator corresponding to the reference category. In particular, when there is asingle dependent 
variable with exactly two levels, there will be asingle indicator, and convergence will occur in 
a single NIPALS iteration. 


Missing Values 
Cases with user- or system-missing values are handled as follows: 


Listwise Deletion. Only cases with complete values for all X and Y variables will be used. 


Center and Standardize Variables 


Given a matrix of independent variables X and of dependent variables Y (with the design 
matrix, categorical variable encoding, and missing values), compute the mean and standard 
deviation of each variable, and replace X with the centered and standardized variates 


X:= (X-— WE where Xx is a diagonal matrix of standard deviations and ly is the vector 


of means; similarly for ¥ = YES + ply. This change of coordinates must be reversed after all 


components have been extracted; Y := VEG + Ly. 


Estimation 


When there is only one dependent variable (m=1), use the NIPALS algorithm. Only one iteration 
will be required. When there is more than one dependent variable (m>1), solve the equivalent 
eigenproblem, solving for the vector with the smallest dimension. Use the resulting eigenvector 
as the input to NIPALS, checking the vector with the greatest length for convergence. (This 
check may turn out to be unneeded, in which case one iteration of NIPALS will still be needed 
to obtain all the required vectors.) 


This diagram illustrates the relationship between the vectors and matrices used in the NIPALS 
algorithm, where the vectors should be taken as determined only up to scalar multiples: 


c 
Y ee 
u > q=Yu/(uu) 


ee 
/ 
BS 
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NonLinear Iterative Partial Least Squares (NIPALS) Algorithm 


The classical NIPALS algorithm explicitly takes c and wto have unit norm. In particular, note that 
if there is only one dependent variable Y, then c is a 1x1 unit vector so c = 1, and this will be the 
most useful starting point: initialize u = Y; otherwise, initialize u or any of the vectors to some 
random starting value. Also, when c= 1, then NIPALS converges in only one iteration. 


The following loop may be entered at any point which is most convenient, most especially when 
m=1,c=1, begin atstep 1 withu = Y: 


Repeat until convergence: 


1. w=Xu/(wu) 


2. W=WwWi|wll 
3. t=Xw 

4. c=Yt/(t’t) 
5. C= c/lel 

6. u= Yc 


Although the NIPALS algorithm will in practice be replaced with the solution of an eigenproblem 
(see “NIPALS-Equivalent Eigenproblem”) the relationships defined in the sequence above will be 
used to obtain all the matrices and vectors required. 


Regress X on t and Y on u: 
1. p=X’t/(t’t) 
2. q= Y’u/(w’u) 
Deflate X and Y matrices: 
1X=X-tp 
2. Y = Y-—te’ (usec from step 4, not step 5, above) 


Note that the deflated matrices are the errors E, F at that stage. 


Repeat d times, assembling the t, p, u, q vectors into matrices to obtain the desired factorizations 
into scores T, U, loadings P, Q, weights W, C, and residuals E, F: 


» X=TP +E 
m Y-UQ’+F 


, -l , 
Since the matrices X, Y are centered, note that ( t t) t. Y is the normal equation for aregression 


cae: -1 , a 
of Y on t, likewise (uu) wu X regresses Xonu. Thus the NIPALS algorithm alternates 
between regression and projection. If vectors are considered to be determined only up to length, 
there is no longer any distinction between the two. 
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The matrix of regression coefficients for predicting Y from X is given by either any of the 
following expressions, and is independent of the scalings of T and U: 


B=W'C 
B=W(P'W) C 
B=X'U(TXXU) ‘TY 


W and C are obtained by assembling the wand c vectors into nx d and mx d matrices. This 
solves the PLS Regression equation: 


Y=XB+F 


Until now the X and Y matrices have been assumed to be centered, and (optionally) standardized. 
The parameters B and residuals E and F must be restored to their original coordinates 
B*=5;'BXy, E* = EXx, with the final regression equation in the original coordinates given 
by Y = XB*+(j:y—/:x B*). Also, the residuals F left over after deflating the Y matrix should 
not be used, but are recalculated from the predictions in the centered and rescaled coordinates as F 
= Y — XB; F*= Fy in the original coordinates. 


NIPALS-Equivalent Eigenproblem 


Regarding the vectors as determined only up to length allows the NIPALS loop to be replaced by 
an eigenproblem. One can choose to solve any of the following; typically selecting the matrix 
with the smallest dimension, which will often be the first equation: 


Y XX Ye =dAc 
YY XXu=Au 
X YY Xw = Aw 
XX YY t = At 


Once c (or any of the others) are determined, the rest of the vectors can be determined; at this 
point it is important to keep track of the lengths. 


The eigenproblem can be solved by the Power Method. 


Power Method 


Xj+] = Ax; 
Aj =X jAx; = X ({Xi+1 
Xi+1 = Xj41/|]xi41|| 


Initialize a vector xsay to the vector 1, normalize to unit length, then iterate until convergence. 
The sequence of iterates is guaranteed to converge to the eigenvector associated to the dominant 
(that is, the largest) eigenvalue. Moreover the dominant eigenvalue is guaranteed to be unique. 


Rather than continue to iterate using the power method, switch to Rayleigh Quotient Iteration 


(RQI). 
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Rayleigh Quotient Iteration 


Begin with initial estimates of xy and \o obtained from one or two iterations of the Power 
Method. Then repeat until convergence: 


(A —);I)w =x; (solve for w) 
X21 = w/w 
M1 x i441 AXi41 


The conjugate gradient method may be used to solve for w. 


The eigenproblem is considered solved when the difference between two iterations is small 
enough. However, the eigenproblem is typically solved for c, but the vector of interest is t. One 
iteration of NIPALS is used to obtain the vectors (c, u, w, t). 


Output Statistics 


The following output statistics are available. 


Proportion of Variance Explained 
The proportion of variance explained by the extraction of factor k is given by computing: 
DOE Y ) = (t (k)t(R)) - trace (c(,)c (k)) 
= (t (ayteay) + (€ eQa)) 


SS,(Y) 


VarProp;,(Y) = trace(Y Y) 


The cumulative proportion of variance explained is 
k 
CumV ar Prop,(Y) = SS Var,(Y) 
i=1 


Here t,j,) and c,;,) are the column vectors obtained after k factors have been extracted; that is, the 
kth columns of T and C. Note that c;,;.) is taken from step 4 of the NIPALS algorithm, and is not 
rescaled to unit length as in step 5. 


The proportion of variance explained in X is similar: 
SS(X) = (t (ny t(a)) * trace (Puyp «) 
= (t (nyt(a)) - (p «P(e 


é SSp(X) 
i ar Prop;(X) = ACEO) 
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k 
CumV ar Prop,(X) = S- Var;(X) 


Variable Importance in the Projection (VIP) 


The VIP statistic is computed for each variable and latent factor as 


Se SS(Y) 


VIP 57, << i=l 


k 
S| SSi(¥ 


l=1 


Here 1 < j <nand1<k <d;w%, is the jth element of w/,), where w/,,, is the kth column of W. 


Distance to the Model 


Distance to the model, sometimes denoted DModX and DModY, is given by: 


DModX; e (e; 
D Mody; = Vfif; 


for each row e; of E and f; of F. This may be normalized to: 


DModxX; = 
DMody; = 


PRESS Statistic 


The PRESS residuals are f, f, that is, D.ModY; before any normalizations are carried out. The 
PRESS statistic is simply 


N 
PRESS — ) © DModY? 
i=1 


“Jackknifed”, or more correctly, leave-one-out PRESS residuals are calculated as 
f,;, = +=“ where Y;, is the ith row of Y, Y; is the predicted value for thatrow, and h,; is the ith 
JI= 


, or 1 , 
diagonal element of the “hat” matrix x(x xX) X. Leave-one-out PRESS residuals are not 
available when there are more variables than cases, or when X’X is not invertible for any other 
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reason. The Jackknifed PRESS statistic for model selection is the sum of the squared norm 
of the Jackknifed PRESS residuals: 


N # 
PRESS = Sof (£2) 
i=] 
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The purpose of the PLUM procedure is to model the dependence of an ordinal categorical 
response variable on a set of categorical and scale independent variables. 

Since the choice and the number of response categories can be quite arbitrary, it is essential to 
model the dependence such that the choice of the response categories does not affect the conclusion 
of the inference. That is, the final conclusion should be the same if any two or more adjacent 
categories of the old scale are combined. Such considerations lead to modeling the dependence of 
the response on the independent variables by means of the cumulative response probability. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 77-1 
Notation 
Notation 
Y 
J 
m 
x4 
XxX 
Z 
fijs 
Ni; 
Tig 
Ny 
n 
Tig 
Vij 
0 
B 
T 
B=(0T pl zT)T 
B = (dT, 50,47)" 
e eel 
B= (07, sv) 


Vig 


Description 

The responsevariable, which takes integer values from 1 to J. 

The number of categories of the ordinalres ponse. 

The number of subpopulations. 

m x p* matrix with vector-elementi;’, the observed values at the ith 


subpopulation, determined by the independent variables specified in the 
command. 


m. X p Matrix with vector-element x;, the observed values of the location 
model’s independent variables at the ith subpopulation. 


m. X p Matrix with vector-element x;, the observed values of the scale 
model’s independent variables at the ith subpopulation. 


The frequency weight forthe sth observation which belongs to the cell 
corresponding to Y=j at subpopulation i. 

The sumof frequency weights of the observations that belong to thecell 
corresponding to Y=j at subpopulation i. 

The cumulative total up to and including Y=j at subpopulation i. 


The marginal frequency of subpopulation i. 


The sumof all frequency weights. 

The cell probability corres ponding to Y=j at subpopulation i. 

The cumulative response probability up to and including Y=jat 
subpopulation i. 

(J-1)x1 vectorofthreshold parameters in the location part of the model. 
p*1 vector of location parameters in the location part of the model. 

q*1 vector ofscale parameters in the scale part of the model. 

The {(J—1) + p+ q}*1 vector ofunknown parameters in the generalmodel. 


The {(J-1) +p + q}x1 vector of maximum likelihood estimates ofthe 
parameters in the general model. 


The {(J-1) + p}x1 vector of maximum likelihood estimates ofthe 
parameters in the location-only model. 


The cumulative res ponse probability estimate based on themaximum 
likelihood estimate B in the general model. 
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Notation Description 


Vij The cumulative response probability estimate based on the maximum 
likelihood estimateB in the location-only model. 


Ti The cell response probability estimate based on the maximum likelihood 
estimate B in the generalmodel. 

Taj The cell response probability estimate based on the maximum likelihood 
estimate B in the location-only model. 

é Number of non-redundant parameters in the general model. If all parameters 


are non-redundant,é = (J—1)+ p+ q. 
Number of non-redundant parameters in the location-only model. If all 
parameters are non-redundant, é = (J—1) + p. 


Oc 


Data Aggregation 


Observations with negative or missing frequency weights are discarded. Observations 

are aggregated by the definition of subpopulations. Subpopulations are defined by the 
cross-Classifications of either the set of independent variables specified in the command or the set 
of independent variables specified in the subpopulation command. 


Let n; be the marginal count of subpopulation i, 


J 
ie = y Nij 


j=l 
If there is no observation for the cell of Y=j at subpopulation i, it is assumed that n;; = 0, provided 
that n; # 0 A non-negative scalar 5 € [0,1) may be added to any zero cell (ie., cell with n;; = 0) 


if its marginal count n; is nonzero. The value of 6 is zero by default. 


Data Assumptions 


Let (nj1,..., iy yr be the J x 1 vector of counts for the categories of Y at subpopulation. It is 

assumed that each (7/1, ..., 2i,7 yr is independently multinomial distributed with probability vector 
T : 2 : j 

(7i1,...,7;,) Of dimension J x 1 and fixed total nj. 


Model 


Let 7;; = Prob(Y < j|x;) be the cumulative response probability for Y; that is, 


for j= 1, ..., J-1. Note that +, ; = 1, hence only the first J-1 y’s are needed in the model. 


General Model 


The general model is given by 


PLUM A 


ig link( ij) 
Possible link functions are 


log (5) Logit link 
log (—log (1—y)) Complementary log-log link 


link(y)= ) _ log (—log(y)) | Negative Log-log link 
@-! (+) Probit link 
tan (7 (y —0.5)) Cauchit (Inverse Cauchy) link 


The numerator in the right hand side of the general model specifies the location of the model, 
6; — vf, . Inthe location part of the model, 6 is the vector of thresholds. Values of the thresholds 
are subject to a monotonicity property 0; <...< 6 _,. Bis the vector of location 


parameters. The denominator is the scale part of the model, a(z). Possible forms are: 


1 if unity scale is assumed 
a(2z) 


exp (7T z) if non-constant scale is assumed 


t is the vector of scale parameters. 


Location-Only Model 
If unity scale is assumed, then the general model is said to reduce to the location-only model. The 
parameter B reduces to B=(@T,BT)T. 

Log-likelihood Function 


The log-likelihood of the model is 


m J-1 


i=1 j=1 


where 


j 
rif = S nk 
k=1 


and 


and 


g(y) = log (1 + exp (y)) = log (- ijt ) 


Yig+1—Vig 
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Note: a constant term ¢ = ©)", log {n;!/ (nji!...ni7!)} which is independent of the unknown 
parameters has been excluded here. Thus, ! is in fact the kernel of the true log-likelihood function. 


Derivatives of the Log-likelihood Function 


The derivatives of the log-likelihood function are used in the iterative parameter estimation 
algorithm. 


First Derivative 


The first derivative of | with respect to B;.,k = 1,...,(/ —l)+p+q,is 


m J-1 


ae = ea es 5 Vii Vijk 


fal jaa OP 


OY; 


Oy Vij 
Qi; =f ijk On, ; oa, - Pijtiky Vij+l oF | 


where 
a. ifl<k<(J—-1 
wo(Ts) en 
Onig _ zy . 
Pik = 90 = ) tee if (J-1)+1<k<(J—1)+p 
exp Zi 


—Zi[k—{(J—1)+p}) ij if (J-—1)+p+1<k<(J—1)+p+q 


6jk = 1ifj =k, 0 otherwise, and P; 7, = 0. For i= 1, ...,.m,j=1,...,J-1, 


veg (1 — V3) Logit link 
—(1-—~%;) log (1 — yj) Complementary log-log link 
ai = ¢ ~4;; log (ij) Negative Log-log link 
i ¢(&! (+;)) Probit link 


cos? (x (4;; —0.5)) /m  Cauchit link 
and 0%. /Oni7 = 0. 
Second Derivative 


The second derivative is 


m J—-1 52 ‘ qr 
ot Or l; : Ol; OUj; Ol, .. OQijr 
—— U;;Qijr + - Qi + ——_U;,— 
TB.OB, > ms (seo jQijn 4 Ov;, OB, igh Op 1 OB, 
= 


ae 


fors,k=1,...,(J-1)+ p+q. The first term of the equation is 


et rigtil 
DB.Ooy Oy Vii Qijr = eae Ui jQijs Qijk 


The second term is 


Ol, Oui; / Vij I 
7 ae ik = TAT Clit 5 
Ovi; OB, Qijk rij ij+l Yij+1 Vij Yig+1—7 


To calculate the third term, notice that 


IQ ijk OP jn Oi; 


fae OPijsip Vig Ovig 
OB, OB; Oni; t Pigeon oe > — UB 7: Figs Om 
Qijs IVij Vij oP ij 
Pij+ik Vij+1 Onis 7 Tata ij+1 OBs As Te 
where 
0 1<k<(J—-1)+pandi<s<(J-1 
on =) —2ifs-s-1prpy Pik LS k<(J—1)+pand (J—1)+p+ 
—Zyr-((t—-p4+pyPijs (J -l)+ p+ isks(J-1)+ptq 


and OP; 7;./0B, = 0. Moreover, 
074 ay 


ij __ OViG 
OB,Onij Rison, Oni j Pijs 


and 074;;/0B,0n;; = 0. R;; has the following form: 


1- 255 Logit link 

1 + log (1 — y;) Complementary log-log link 
Ri; (1 + log 44;) Negative Log-log link 

—0 (o-! (ij) ot (73) Probit link 

sin (277; ) Cauchit link 


The third term can be calculated by applying these equations. 


Expectation of the Second Derivative 


Fors, k=1,...,(J-1) +p+q. 


E 071 _ = a Oj Ue 
Ca = Fanon! Dijk 


i=1 j=l 
m J-1 
ij+1 
= S p(-2 ; Ti jQijs Qn) 
i=1 j=l Mj 
m J-1 
= ~ Nj UijQijs Qi jk 
i=l gel 


Parameter Estimation 


Further details of parameter estimation are described here. 
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UijQijs + Gans? (Uij41Qi; +ls 7 UijQijs) )) Quin 


pz 
1l<s en rere 
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Maximum Likelihood Estimate 


To obtain the maximum likelihood estimate of B, a Fisher Scoring iterative estimation method or 
Newton-Raphson iterative estimation method can be used. Let B') be the parameter vector at 
iteration t and 0//0B\") bea vector of the first derivatives of | evaluated at B = B"'). Moreover, 
let A“) bea {(J-1)+pt+q}x{(J-1)+p+q} matrix such that 


Newton-Raphson approach 


v7 
AG .= = 55 BBs acai 
, le = x fala : : 
E | Fisher Scoring approach 
B=B(°?t) 


0B.UB, 


For a location-only model, the corresponding formulas use the first (J-1)+p elements of 
01/0B™ and the upper {(J—-1)+p}{(J-1)+p} submatrix of A“. 


The parameter vector B at iteration ¢ + 1 is updated by B''*!) where 


()A(t+1) — A(t) Plt) 4 al 
A'’B =A‘’B bana 


and € > 0 is a stepping scalar such that /(B''’!)) —1(B")) > 0. 


Stepping 


Use the step-halving method if /(B‘'~')) —/(B")) < 0. Let Vbe the maximum number of steps 
in step-halving; then the set of values of € is {1/2¥: v=0, ..., V—-1}. 


Starting Values of the Parameters 


Location-Only Model 


: ; tae oT aT\l 
If a location-only model is specified, set B‘”) = (a ,0 ) where 


General Model 


If a general model is specified, first ignore the scale part; that is, by assuming that t = 0 and 


(oyT oT) 


treating the model as if it is alocation-only model, and use B‘”) = (q as the starting 


value to obtain the maximum likelihood estimate B. After B is obtained, find the maximum 


likelihood estimate B of the general model by starting at (a7, 5T, of) 
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ee : ; wT aT -T\E ; 
The above practice is essentially the same as taking B‘°) = (a ,0-,0 ) . The advantage is 
that the maximum likelihood estimate B can be obtained in the process of finding B. 


Ordinal Adjustments for the Threshold Parameters 


If the monotonicity property #1 <...< § ~1 is not preserved at the end of any iteration, an ad 
hoc adjustment will be taken before the next iteration starts. 1f0\" aes ate , for some j, then both 


p" and G4 are set to (0\” a3)/2 before the next iteration. This value is then compared 


with bits and so on. 
Convergence Criteria 


Given convergence criteria ¢;, > 0 and «, > 0, the iteration is considered to be converged if one of 
the following criteria are satisfied: 


Ji (BUT) -1(BM)| <& 


(t4+1) | ||-c28 os 
BY) Bil <p 


max; 


Statistics 


The following statistics are available. 


Model Information 


The model information is the —2 log-likelihood of the model, computed for a given vector of 
parameter estimates. 


Final Model, General 
The value of —2 log-likelihood of the model is given by 
-21 (B) 


where | (8) is the value of the log-likelihood evaluated at B. 


Final Model, Location-Only 


If unity scale is assumed, the general model reduces to the location-only model. The value of —2 
log-likelihood of the model is given by 


-21 (B) 
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Initial Model, Intercept-Only 
In the initial model, when the intercepts are the only parameters in the model, the parameter vector 


T 
is BO) = (q T of, 0°) . The value of the —2 log-likelihood _ is 


~21 (B) 


Model Chi-Square 


The value of the Model Chi-square statistic is given by the difference between any two nesting 
models of interest. 


General Model versus Intercept-Only Model 


The following statistic is available when a general model is specified. The Model Chi-square 
Statistic is given by 


-21 (B)) — 21 (B) 


Under that null hypothesis that 79 : b= 0 and t = O, the Model Chi-square is asymptotically 
chi-squared distributed with é — (J— 1) degrees of freedoms. 


Location-Only Model versus Intercept-Only Model 


The following statistic is available when a location-only model is specified. The Model Chi-square 
Statistic is given by 


-21 (B')) — 21 (B) 


Under that null hypothesis that Hp : b = g, the Model Chi-square is asymptotically chi-squared 
distributed with é — (J—1) degrees of freedoms. 


General Model versus Location-Only Model 


The following statistic is available when a general model is specified. The Model Chi-square 
Statistic is given by 


-21 (B) - 21 (B) 
Under that null hypothesis that Hp : t= 9, the Model Chi-square is asymptotically chi-squared 
distributed with é — é degrees of freedoms. 


Likelihood Ratio Test for Equal Slopes Assumption 


For location-only model, a likelihood ratio test of parallel lines in the location is performed. If the 
regression lines are not parallel, the location can be specified as 
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nig = 9; — bya; 
forj=1,..., J-1. That is, the location parameters b (or slopes) vary with the levels 
of the response. The parameter for the above “non-parallel” location-only model is 


T 
B= qi pl obs which is of dimension {(J—1)+(J—1)p}*1. The first derivative 0//0B of 


I ges 


the log-likelihood is the same as in the “parallel” model, except that Pj. = On; ;/OB;, is 
replaced by the following: 
| Le 1<k<(J-1) 

ijk = OB, — —Xik—{(J-1)+ep}] (J -1l) +sp<k< (J—-1)+spt+p,s=1,...,(J —2) 


Similarly, the expected value of the second derivative is the same as in the parallel model, except 
that the P;;;, is replaced by the above equation. 


To test the null hypothesis of parallelism Hp :b, =...=b, 4, find the maximum likelihood 


estimate B of the parallel location-only model and the maximum likelihood estimate B of the 
non-parallel model. The Model Chi-square statistic is given by 


-21 (B) -21(B) 
Under the null hypothesis, the Model Chi-square statistic is asymptotically chi-squared distributed 
with (k—2)p degrees of freedoms. 


Pseudo R-Squares 


Replace B by B for a location-only model in the equations below. 


Cox and Snell’s R-Square 


: L(BO)\ * 
3 a (452) 


Nagelkerke’s R-Square 


2 RCS 
MN Tol (Bi)? 
McFadden’s R-Square 


2 (B) 
My =1- (ray) 


Predicted Cell Counts 


The estimated cell response probability based on the maximum likelihood estimate for the 
general model is 


PLUM Algorithms 


At each subpopulation i, the predicted count for response category Y=] is 
Nig = GTi; 
The (raw) residual is Nij — Nij and the standardized residual is (n;; — nij) [J nitij (1 —7i;)- 


Replace 4; by 4i;, 7:1; by 7;;, andn;; by n;; for alocation-only model. 


Predicted Cumulative Totals 
The predicted cumulative total up to and including Y=] is 
Pig = Yay 
The (raw) residual is rig — Fij and the standardized residual is (7;; — #ij) /J/nitiz 1 - ij) - 


Replace 4; by 4; andi; by*i; for alocation-only model. 


Goodness of Fit Measures 


These are chi-square statistics used to test whether the model adequately fits the data. 


Pearson Goodness of Fit Measure 
The Pearson goodness of fit measure for a general model is 


m J , - 2 
re sy (nj — N75) 
° Ni Tj 


i=1 j=l 
Under the null hypothesis, the Pearson goodness-of-fit statistic is asymptotically chi-squared 
distributed with m(J— 1) — é degrees of freedom. 


Replace 7;; by 7;; andé by é fora location-only model. 


Deviance Goodness of Fit Measure 


The Deviance goodness of fit measure for a general model is 


m J n 
5\- : aj 
D=2) ) nj; log 
: NT; 


i=1 j=l 
Under the null hypothesis, the Deviance goodness-of-fit statistic is asymptotically chi-squared 
distributed with m(J— 1) — é degrees of freedom. 


Replace 7;; by z;; and é by é for a location-only model. 
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Covariance and Correlation Matrices 


The estimate of the covariance matrix of B is 


: — 21 : Newton-Raphson method 
Cov (B) = | 7 #7B 8 


“E G25 ir eS Fisher Scoring method 


Let be the {(J—1)+p+q}*1 vector of the square roots of the diagonal elements in B). The 


estimate of the correlation matrix of B is 


Replace B by B and by (a {(J—1)+p}1 vector) for alocation-only model. 


Parameter Statistics 
An estimate of the standard deviation of B;, is 6, The Wald statistic for B;, is 
Wald, = a 


Under the null hypothesis that Ho : B;. = 0,Wald;is asymptotically chi-squared distributed 
with 1 degree of freedom. 


Based on the asymptotic normality of the parameter estimate, a 100(1—a) % Wald confidence 
interval for B;. is 


By = 21-a/2ok 
where 2, _,, /9 18 the upper (1—-a /2)100th percentile of the standard normal distribution. 


Replace B;. by B;, and 6, by & for alocation-only model. 


Linear Hypothesis Testing 
For a general model, let L be a matrix of coefficients for the linear hypotheses 
Hy): LB=c 
where cis a kx1 vector of constants. The Wald statistic for Hy is 
Wald (L,c) = (LB ~ c) {ECov (B)L ne (LB-c) 


Under the null hypothesis, Wald (L,c) is asymptotically chi-squared distributed with | degrees of 
freedom, where ! is the rank of L. 


Replace B by B for a location-only model. 
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PPLOT Algorithms 


PPLOT produces probability plots of one or more sequence or time series variables. The variables 
can be standardized, differenced, and/or transformed before plotting. Expected normal values or 
deviations from expected normal values can be plotted. PPLOT can be used to investigate whether 
the data are from a specified distribution: normal, lognormal, logistic, exponential, Weibull, 
gamma, beta, uniform, Pareto, Laplace, half normal, chi-square and Student’s t. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 78-1 

Notation 
Notation Description 
X Sample mean 
S Sample standard deviation 
hx Sample mean for Jn (x; ) 
LS Sample standard deviation for In (2:;) 
vj Value of the ith observation 
Zi) The ith smallest observation 
R; Corresponding rank for z; 
n Sample size 
fraist (vi) Fractionalrankof x; for the specified distribution function 
Qaist (Li) Score for the specified distribution function 
a Location parameter 
8 Scale parameter 
of Shape parameter 
y Degrees of freedom 

Fractional Ranks 
Based on the rank #2; for the observation 2:;, the fractional rank frujs; (7;) is computed and 


used to estimate the expected cumulative distribution function of X. One of four methods can be 
selected to calculate the fractional rank fra; 5; (a’;): 


(Ri — 3) / (n+ 3) Blom 
1) / d 
fraist (xi) = (Ri — 2) Ad ; Rankit 
(Ri — 3) / (n+ 3) Tukey 
Rj/(n +1) VanderWaerden 
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Scores 


The score of the specified distribution for case i is defined as 
, p= fe. ae 7 
Adist (2 i) = Past (fraist (a i)) @=1,...," 
where F.j;¢ is the inverse cumulative specified distribution function. 


P-P Plot 


For a P-P plot, the fractional rank and the cumulative specified distribution function Fis: are 
plotted: 


ruales) Paealey)) t= Laat 


Q-Q Plot 
For a Q-Q plot, the observations and the score for the specified distribution function are plotted. 


(Xj, Adist (x3) t= Neen, Tt 


Distributions 


The distributions and their parameters are listed below. Parameters may be either specified by 
users or estimated from the data. Any parameter values specified by the user should satisfy the 
conditions indicated. 


Table 78-2 
Distributions 
Distribution Description 
Beta(3; .B2) 81(> 0) and $2(> 0) are scale parameters. 
Chi-square(v) v(> 0) is the degrees of freedom. 
Exponential(3) (> 0) is a scale parameter. 
Gamma(7. 3) (> 0) is a shape parameter and 3(> 0) is the scale parameter. 
Half Normal(3) B(> 0) is a scale parameter and the location parameter is 0. 
Laplace(a. 3) @ is the location parameter and §(> 0) is the scale parameter. 
Logistic(a, 3) a is the location parameter and 3(> 0) is the scale parameter. 
Lognormal(. yy) (> 0) is a scale parameter and 7(> 0) is a shape parameter. 
Normal(a. 3) a is the location parameter and 3(> 0) is the scale parameter. 
Pareto(3.b): B(> 0) is scale parameter and b(> 0) is an index of inequality. 
Student’s t(v) v(> 0) is the degrees of freedom specified by the user. 
Uniform(a.b) ais a lower bound and 5 is an upper bound. 


Weibull(3.7) (> 0) is a scale parameter and y(> 0) is a shape parameter. 
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Estimates of the Parameters 


The estimates for parameters for each distribution are defined below. 


Table 78-3 
Parameter estimates for distributions 
Distribution Description Parameter type 
Beta( 31, (9) =f xXG-x) scale parameter 
By = X ey. -1 
: — {xX(1-x scale parameter 
3 =(1- x) {7-1} P 
Chi-square’) v is the degrees of freedomspecified by 
the user. 
Exponential(3) B= a scale parameter 
Gamma(-y, 3) tise x* shape parameter 
=o 
3- ES scale parameter 
Half Normal(,3) B= /e+..+02 scale parameter 
Laplace(a, (3) A= xX. location parameter 
3 [52 scale parameter 
Logistic(a, 9) Ae. location parameter 
B= /3(£) m= 3.1415927 scale parameter 
Lognorm al 3 =exp(LX) scale parameter 
4=LS shape parameter 
Normal(a, (3) a=X location parameter 
6=S scale parameter 
Pareto(3,b); 9 —min {x1,...,2n} scale parameter 
i 1 index of inequality 
~ BX=In (3) 
Student’s tv) vy is the degrees of freedomspecified by 
the user. 
Uniform(a,b) 4@=min{r,...,2,} lower bound 
ae eee upperbound 
Weibull(,-+) n re scale parameter 
“Ui; — nUY 
3 = = 
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Distribution Description Parameter type 
4 = exp (- (7 = 30) /3)) shape parameter 
where 
Y¥; = In(-—In(1 — frais:(a;))) and U; = In (2;) 
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PRINCALS Algorithms 


The PRINCALS algorithm was first described in Van Rijckevorsel and De Leeuw (1979) and 
De Leeuw and Van Rijckevorsel (1980); also see Gifi (1981, 1985). Characteristic features of 
PRINCALS are the ability to specify any of a number of measurement levels for each variable 
separately and the treatment of missing values by setting weights in the loss function equal to 0. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n Number of cases (objects) 
Number of variables 
p Number of dimensions 
For variable j; 7 = 1, ...,m 
hy n-vector with categorical observations 
kj Number of valid categories (distinct values) of variable j 
G; Indicator matrix for variable j, of order n x k:; 
of 1 when the «th object is in the rth category of variable 7 
G(jyir 0 when the th object is not in the rth category of variable ; 
D; Diagonal matrix, containing the univariate marginals; that is, the column 
sums of G; 
M,; Binary diagonalnxn matrix, with diagonal elements defined as 
ie 1 when the th observation is within the range {1, /:;| 
aie 0 when the zh observation outside the range [1, /: | 


The quantification matrices and parameter vectors are: 


Xx Object scores, of order nxp 

Y; Multiple category quantifications, oforderk; x p 

Yj Single category quantifications, of order k:; 

aj Variable weights (equal to component loadings), of order p 

Q Trans formed data matrix of ordernxm with columns q; = Gjy; 
Y Collection of multiple and single category quantifications. 


Note: The matrices G;, M,, and D; are exclusively notational devices; they are stored in reduced 
form, and the program fully profits from their sparseness by replacing matrix multiplications 
with selective accumulation. 
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Objective Function Optimization 


The PRINCALS objective is to find object scores X and a set of xy (for j=1,...,m) — the 
underlining indicates that they may be restricted in various ways — so that the function 


o(X:Y) = L/mtr((X ~G,Y,) M,(K- G,Y,)) 


is minimal, under the normalization restriction X M.X = mnI where M, = M,;; and Tis the 


pxp identity matrix. The inclusion of M; in o(X; Y) ensures that there is no influence of data 
values outside the range (1, /:;], a circumstance that may indicate either genuine missing values or 
simulated missing values for the sake of analysis. M., contains the number of “active” data values 
for each object. The object scores are also centered; that is, they satisfy u’.M, WX = 0 with 

u denoting an n-vector with ones. 


Optimal Scaling Levels 


The following optimal scaling levels are distinguished in PRINCALS: 


Multiple Nominal.Y ; = Y; (equality restriction only). 
(Single) Nominal.¥Y ; =y¥; al ; (equality and rank — one restrictions). 


. . ‘ y . 2 . . 
(Single) Ordinal.Y ; = y;a ; and y; € C; (equality. rank — one, and monotonicity restrictions). 
The monotonicity restriction y; € C; means that y; must be located in the convex cone of all 
k;-vectors with nondecreasing elements. 


(Single) Numerical.Y ; = yja ; and y; € L; (equality. rank — one. and linearity restrictions). The 
linearity restriction y; € L; means that y; must be located in the subspace of all k;-vectors that 
are a linear transformation of the vector consisting of k; successive integers. 


For each variable, these levels can be chosen independently. The general requirement for all 
options is that equal category indicators receive equal quantifications. The general requirement 
for the non-multiple options is Y; = y; a j; that is, Y ; is of rank one; for identification purposes, 
y; is always normalized so that y Diy; = Thy. 


Optimization 
Optimization is achieved by executing the following iteration scheme: 
1. Initialization I or II 
2. Update object scores 
3. Orthonormalization 
4. Update category quantifications 


5. Convergence test: repeat (2) through (4) or continue 
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6. Rotation 


Steps (1) through (6) are explained below. 


Initialization 

I. Random 

The object scores X are initialized with random numbers. Then X is normalized so that 
uM, WX = Oand X M.X = mnt, yielding X. For multiple variables, the initial category 
quantifications are obtained as Y; = D;'G | X. For single variables, the initial category 
quantifications ¥; are defined as the first /:; successive integers normalized in such a way that 


uD,¥; = 0and ¥,D,¥; =n, and the initial variable weights are calculated as the vector 
aj = X'G,; y;, rescaled to unit length. 


II. All relevant quantities are copied from the results of the first cycle. 


Update object scores 
First the auxiliary score matrix Z is computed as 
Z + 5j)M,G;Y; 


and centered with respect to M..: 
Ze {M, = (M.uu'M., /u'M,u) }Z 
These two steps yield locally the best updates when there would be no orthogonality constraints. 


Orthonormalization 


The problem is to find an M.-orthonormal X* that is closest to Z in the least squares sense. In 
PRINCALS, this is done by setting 


ae m!/?MZ“/°GRAM (M-'/°Z) 


which is equal to the genuine least squares estimate up to a rotation—see (6). The notation 
GRAM ) is used to denote the Gram-Schmidt transformation (Bjork and Golub, 1973). 


Update category quantifications 
For multiple nominal variables, the new category quantifications are computed as: 
Y; =D7'G';X 


For single variables one cycle of an ALS algorithm (De Leeuw et al., 1976) is executed for 
computing the rank-one decomposition of Y, with restrictions on the left-hand vector. This cycle 
starts from the previous category quantification yi; with 
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{ .7! ~ 
aj = Y,D; yj 
When the current variable is numerical, we are ready; otherwise we compute 
* 7, Le + 
Yj = Y; aj . 


Now, when the current variable is single nominal, you can simply obtain y ; by normalizing 
y; in the way indicated below; otherwise the variable must be ordinal, and you have to insert 
the weighted monotonic regression process 


y;<— WMON(* ). 
vj 

The notation WMONC ) is used to denote the weighted monotonic regression process, which 

makes y; monotonically increasing. The weights used are the diagonal elements of Dj; and the 


subalgorithm used is the up-and-down-blocks minimum violators algorithm (Kruskal, 1964; 
Barlow etal., 1972). The result is normalized: 


Finally, we set Y | = yja; ; 


Convergence test 


The difference between consecutive values of the quantity 


TFIT = 1/m)~ | S~y (ysDi¥uys + 8 58) 


s jEeJd j¢éJ 


where y,;), denotes the sth column of Y; and J is an index set recording which variables are 
multiple, is compared with the user-specified convergence criterion ¢ - a small positive number. 
It can be shown that TFIT = p — o(X; Y). Steps (2) through (4) are repeated as long as the 
loss difference exceeds ¢. 


Rotation 


As remarked in (3), during iteration the orientation of X and Y with respect to the coordinate 
system is not necessarily correct; this also reflects that «(X; Y) is invariant under simultaneous 
rotations of X and Y. From the theory of principal components, it is known that if all variables 
would be single, the matrix A — which can be formed by stacking the row vectors a ‘j—has the 
property that A’A is diagonal. Therefore you can rotate so that the matrix 


L/mA A = 1/mS- aja j = 1/mS° Y ,;D,Y; 


J J 
becomes diagonal. The corresponding eigenvalues are printed after the convergence message 


of the program. The calculation involves tridiagonalization with Householder transformations 
followed by the implicit QL algorithm (Wilkinson, 1965). 
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Diagnostics 


The following diagnostics are available. 


Maximum Rank (may be issued as a warning when exceeded) 


The maximum rank pyjax indicates the maximum number of dimensions that can be computed 
for any dataset. In general 


Pmax = min ¢ (n — 1), [x kj + mz ] — max(m,, max (0,1 — mz)) 


Jed 


where my is the number of multiple variables with no missing values, mz is the number of single 
variables, and J is an index set recording which variables are multiple. Although the number of 
nontrivial dimensions may be less than pmax when m=2, PRINCALS does allow dimensionalities 
all the way up to Pmax. When, due to empty categories in the actual data, the rank deteriorates 


below the specified dimensionality, the program stops. 

Marginal Frequencies 
The frequencies table gives the univariate marginals and the number of missing values (that is, 
values that are regarded as out of range for the current analysis) for each variable. These are 
computed as the column sums of D; and the total sum of M). 

Fit and Loss Measures 
When the HISTORY option is in effect, the following fit and loss measures are reported: 
Total fit. This is the quantity TFIT defined in (5). 
Totalloss.This is o(X;Y), computed as the sum of multiple loss and single loss defined below. 
Multiple loss. This measure is computed as 
TMLOSS = p ~ 1/m¥r [Y’;D,Y,] 


Single loss. This measure is computed only when some of the variables are single: 


SLOSS = 1/m) t/Y¥'j)D;¥j] + So a‘ja; 
£ JGd 


JI¢e 


Eigenvalues and Correlations between Optimally Scaled Variables 


If there are no missing data, the eigenvalues printed by PRINCALS are those of 1 /7mR(Q), where 
R(Q) denotes the matrix of correlations between the optimally scaled variables in the columns 

of Q. For multiple variables, qj is defined here as Gj y,;);. When all variables are single or 
when p=1, R(Q) itself is also printed. If there are missing data, then the eigenvalues are those 
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of the matrix with elements q, M-'q;, which is not necessarily a correlation matrix, although 
it is positive semidefinite. 
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PROBIT Algorithms 


The Probit procedure is used to estimate the effects of one or more independent variables ona 
dichotomous dependent variable. The program is designed for dose-response analyses and related 
models, but Probit can also estimate logistic regression models. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


m Number of covariate patterns 

ni Number of subjects for ith covariate pattern 

ri Number of responses for ith covariate pattern 

Pp Number of independent variables 

q Number of levels of the grouping variable. q=0 when there is no grouping 
variable 

€ Natural response rate 

xX n X (p+ q) matrix with element x;;, which represents the jth covariate 
for the ith covariate pattern 

Y p*1 vector with element y;, which represents the slope parameter of the jth 
independent variable 

o, q*1 vector with element a ;, which represents the parameter for the jth level 
of the grouping variable 

B (p+ q) x 1 vectorwhich is acomposite ofy and a 


n 


Totalnumber of parameters in the model, equalto p+q if the natural 
response rate is set to aconstant, p+q+1 ifthe natural response rate is to 
be estimated by the model 


Model 
The model assumes a dichotomous dependent variable with probability P for the event of interest. 


Since the procedure assumes aggregated data for every covariate pattern, the random variable 
y; takes a binomial distribution. 


P(y, =1i) = & )rra =P) gad 


Hence, the log likelihood, L, for m observations after ignoring the constant factor can be written as 


L= onl, +(n; —r;) In(1 - P;) 


i=l 


For dose-response models, it is further assumed that 


Paes] ~c)F(X)3) 
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where x is the vector of covariates for the ith covariate pattern and F ( x ) has two forms: 


if logit model 


| ——¢~*'/2dz if probit model 


When there is no grouping variable, «:;; is simply the observed value of the jth independent 
variable for the ith covariate pattern, and B=y. When there is a grouping variable, a set of indicator 
variables is constructed. There willbe q indicator variables /;,,...,/;, added to the X matrix and q 
parameters a),...,«,, added to the B vector. 


ie 1 if the zh covariate pattern is in the jth level 
“~ | 0 otherwise 


Hence, the X; vector has p+q elements and the associated parameter vector B is expanded to 


(3; prety Bo, Bos Lis3329 Bp +q ), where Oj = Bo+j- 


Maximum-Likelihood Estimates (MLE) 


To obtain the maximum likelihood estimates for c, and 3),...,(3,,.,, set the following equations 
equal to 0: 
“ri —nP, is 
i= 5) BMP A - w(xip)] 
; 2. Pa-P) ( 7) 
: m r; —niP; - F(x’ 8 l F(x's ‘Fiaow del 
L | VD, P,(1— P,)"® ( 13) ( = ( i )) eae eis 


m — P, 1 1 ; 2 F . a 
(1 E> ra—n ij exp { : (x;s) } if probit model 
V on s 


<~ P(1— P, 


where L.’; is the derivative of L with respect to 3}. 


Algorithm 


Probit uses the algorithms proposed and implemented in NPSOL by Gill, Murray, Saunders, 
and Wright. The loss function for this procedure is the negative of the log-likelihood described 
in the model. The derivatives for the parameters are described above. The only bound for the 
parameters is 0 < c < 1. For more details of the NPSOL algorithms, see CNLR (constrained 
nonlinear regression). 


Natural Response Rate 


When the user specifies a fixed number for the natural response rate, L* is set to 0 for iterations 
and the bound for c is set equal to the fixed number. 
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Initial Values 
The initial value for each ;3 is set to 0. If there is a control group, the initial value of c, designated 
by co, is set to the ratio of the response to the number of subjects for the control group. If there is 


no control group, then cg is set to the minimum ratio of the response to the number of subjects, 
over all covariate patterns. 


Criteria 
Users can control two criteria, ITER and CONV. ITER is the maximum number of iterations 


allowed. The default value is max (50,3(s + 1)). CONV (criterion of convergence) is the same 
as the OPTOLERANCE criterion in CNLR. 


Asymptotic Covariance Matrix 


The asymptotic covariance matrix for the MLE G Bigis ea bps ‘) is estimated by I~!, where I is 
the information matrix containing the negatives of the second partial derivatives of L. 


r, — niP; ri 
P,(1 — P;)? P;(1— Pi) 


“Sp = a (a — o)(1 — F(X;8)) hh at 
i=1 


P,1—P)? Pi(l— Pi) dX; 
where 
ar (x) 7 F(X;8) (1 = F(X.8)) if logit model 
aX, 8 Taz oP {-3(xi8)'} if probit model 
re aF (x{s) ° 
Teo a “)e, =r aT 7 =e P;) mara dX;,8 
mr op dF (X‘8 
+(1- M2, Gren i 
where 
ar(x's) F(x)s)(1~ F(x‘s)) (1 2n(X6)) if logit model 
xB AW ( > 8) exp ( (xis) ) if probit model 


Frequency Table and Goodness of Fit 


For every covariate pattern i, i=1,...,m, compute 


PROBIT Algorithms 


<_— if logit model 


e-“/2g7 if probit model 


Then the expected frequency is equal to 
E; = njP; 


The Pearson chi-square statistic is defined by 


and the degrees of freedom (df) is 


» f(q-l)ym-s ifg>2 
p= {ant ifq=1 


Fiducial Limits, RMP, and Parallelism 


The parallelism test statistic, fiducial limits, and relative median potency are available when 

there is only one covariate (predictor variable). Assuming that 4,,...,a, are the MLE’s for 

Q ,...,@, and 4 is the MLE for 4, v (a;) is the asymptotic variance for 4, v (+) is the asymptotic 
variance for 7, and cov (aj;,7) is the asymptotic covariance for 4; and +. 


Fiducial Limits for Effective dose x 


For level of the grouping variable j and P = 0.01 through 0.09, 0.10 through 0.90 (by 0.05), and 
0.91 through 0.99, compute 


_ fln(P/(1-P)) _ if logit model 
~~ | probit (P) if probit model 


Then the effective dose « ; to obtain probability P of response for level j is defined by 
rj = ((y—a;)/4) 
and the 95% fiducial limit for effective dose x; is computed by 


1 9 fm 1 C0v(4 5,9) 
vj eae u(y) 


5 SET { o(4;) + 2a ;cou(a;,4) + 24u(7) — g(v(a)) - cone) hie 
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“without heterogeneity factor 
g= 2444 
“hr with heterogeneity factor 
= 1.96 without heterogeneity factor 
~ | tio.025,ap) With heterogeneity factor 


h=«7/(df) 


. 1 without heterogeneity factor 
AS : : 
h_ with heterogeneity factor 


The heterogeneity factor is used if the Pearson chi-square statistic is significant. 


Note: If the covariate (predictor variable) x is transformed, transform it back to the original 
metrics for the estimate and its two limits. For example, if log,, is applied to the predictor for the 
analysis and ;-, , 7,7; are the lower limit, the estimate, and the upper limit on the log,, scale, 
then L0"‘and 10°" are the lower and upper limits on the original scale. 


Relative Median Potency 


The relative median potency is available when there is a factor variable and a single covariate. It 
is not available if there is no factor variable or if there is more than one covariate. 


The estimate of relative median potency for group j versus group k is 
Myx =(a = aj )/ 7 


and its 95% confidence limit is 


My + 7% (Mj uz ) a yim 2Mjxv12 + M3,v22 — a(vn via) bps 


where 

U11 = v(G;) + v(Ax) — 2cov (4;, ax) 
U12 = cov (G;,¥) — cov (Gx, 4) 

V22 = (7) 


Note: If the covariate (predictor variable) x is transformed, transform it back to the original 
metrics for the relative median potency. 


Parallelism Test Chi-Square Statistic 


The parallelism test is available only if there is a factor variable. 
q e 
7 =x-doxG 
j=l 


where \j is the Pearson chi-square statistic, assuming that the group variable is in the model and 
3 is the Pearson chi-square for the jth group and the degrees of freedom for \? is q-1. 
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PROXIMITIES Algorithms 


PROXIMITIES computes a variety of measures of similarity, dissimilarity, or distance between 
pairs of cases or pairs of variables. 


Standardizing Cases or Variables 


Either cases or variables can be standardized. The following methods of standardization are 
available: 


PROXIMITIES subtracts the mean from each value for the variable or case being standardized and 
then divides by the standard deviation of the values. If a standard deviation is 0, PROXIMITIES 
sets all values for the case or variable to 0. 


RANGE 


PROXIMITIES divides each value for the variable or case being standardized by the range of the 
values. If the range is 0, PROXIMITIES leaves all values unchanged. 


RESCALE 


From each value for the variable or case being standardized, PROXIMITIES subtracts the 
minimum value and then divides by the range. If a range is 0, PROXIMITIES sets all values for 
the case or variable to 0.50. 


MAX 


PROXIMITIES divides each value for the variable or case being standardized by the maximum 
of the values. If the maximum of a set of values is 0, PROXIMITIES uses an alternate process 
to produce a comparable standardization: it divides by the absolute magnitude of the smallest 
value and adds 1. 


MEAN 
PROXIMITIES divides each value for the variable or case being standardized by the mean of 


the values. If a mean is 0, PROXIMITIES adds one to all values for the case or variable to 
produce a mean of 1. 


SD 


PROXIMITIES divides each value for the variable or case being standardized by the standard 
deviation of the values. PROXIMITIES does not change the values if their standard deviation is 0. 
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Transformations 


Three transformations are available for the values PROXIMITIES computes or reads: 


ABSOLUTE 


Take the absolute values of the proximities. 


REVERSE 


Transform similarity values into dissimilarities, or vice versa, by changing the signs of the 
coefficients. 


RESCALE 


RESCALE standardizes the proximities by first subtracting the value of the smallest and then 
dividing by the range. 


If you specify more than one transformation, PROXIMITIES does them in the order listed above: 
first ABSOLUTE, then REVERSE, then RESCALE. 


Proximities Measures 


Measure defines the formula for calculating distance. For example, the Euclidean distance 
measure calculates the distance as a “straight line” between two clusters. 


Measures for Continuous Data 


Measures for continuous data, also called interval measures, assume that the variables are scale. 


EUCLID 


The distance between two items, x and y, is the square root of the sum of the squared differences 
between the values for the items. 


EUCLID (x, y) = \/ Sj(aj — yi)? 


SEUCLID 


The distance between two items is the sum of the squared differences between the values for 
the items. 


SEUCLID (x,y) = Yi(a; — yi)? 
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CORRELATION 


This is a pattern similarity measure. 


Li(ZriZyi) 


CORRELATION (:, y) = = 


where Z,; is the (standardized) Z-score value of x for the ith case or variable, and N is the number 
of cases or variables. 


COSINE 


This is a pattern similarity measure. 


COSINE(:, y) = 


CHEBYCHEV 


The distance between two items is the maximum absolute difference between the values for 
the items. 


CHEBYCHEYV (7, y) = max;|.r; — y;| 


BLOCK 


The distance between two items is the sum of the absolute differences between the values for 
the items. 


BLOCK (ry) = Sylavi — yi| 


MINKOWSKI(p) 


The distance between two items is the pth root of the sum of the absolute differences to the 
pth power between the values for the items. 


MINKOWSKI(r,y) = (Siler) — yl?) /” 


POWER(p,r) 


The distance between two items is the rth root of the sum of the absolute differences to the pth 
power between the values for the items. 


POWER (xr, y) = (Syl = yi?) 
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Measures for Frequency Count Data 


Frequency count measures assume that the variables are discrete numeric. 


CHISQ 


PH2 


The magnitude of this dissimilarity measure depends on the total frequencies of the two cases or 
variables whose proximity is computed. Expected values are from the model of independence of 
cases (or variables), x and y. 


This is the CHISQ measure normalized by the square root of the combined frequency. Therefore, 
its value does not depend on the total frequencies of the two cases or variables whose proximity is 
computed. 


_ CHISQ(x, y) 


PH2(.r,y) 


Measures for Binary Data 


Binary measures assume that the variables take only two values. 


PROXIMITIES constructs a2 x 2 contingency table for each pair of items in turn. It uses this 
table to compute a proximity measure for the pair. 
Table 81-1 
2 x 2 Contingency table 
Item 2 Present Item 2 Absent 
Item 1 Present a b 
Item 1 Absent c d 


PROXIMITIES computes all binary measures from the values of a, b, c, and d. These values 
are tallies across variables (when the items are cases) or tallies across cases (when the items 
are variables). 


Russel and Rao Similarity Measure 


This is the binary dot product. 
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Simple Matching Similarity Measure 


This is the ratio of the number of matches to the total number of characteristics. 


Jaccard Similarity Measure 


This is also known as the similarity ratio. 
a 


JACCARD (7, y) = See 


Dice or Czekanowski or Sorenson Similarity Measure 


2a 
DICE (x,y) = eee 


Sokal and Sneath Similarity Measure 1 


SS1/ 2(a +d) 
fy) SS 
~ 2iat+d)+b+e 


Rogers and Tanimoto Similarity Measure 


Ria = a+d 
Ty ~~ utd+2(b+c) 


Sokal and Sneath Similarity Measure 2 
eC 
(wy) a+ 2(b+c) 


Kulczynski Similarity Measure 1 


This measure has a minimum value of 0 and no upper limit. It is undefined when there are no 
nonmatches (b = Qandc = 0). Therefore, PROXIMITIES assigns an artificial upper limit of 
9999.999 to K1 when it is undefined or exceeds this value. 


a 


K1(v,y) = ie 
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Sokal and Sneath Similarity Measure 3 


This measure has a minimum value of 0, has no upper limit, and is undefined when there are 
no nonmatches () = 0ande = 0). As with K1, PROXIMITIES assigns an artificialupper limit 
of 9999.999 to SS3 when it is undefined or exceeds this value. 


SS3(r,y) = = a 


rec 


Conditional Probabilities 


The following three binary measures yield values that you can interpret in terms of conditional 
probability. All three are similarity measures. 


Kulczynski Similarity Measure 2 


This yields the average conditional probability that a characteristic is present in one item given 
that the characteristic is present in the other item. The measure is an average over both items 
acting as predictors. It has arange of 0 to 1. 


K2(2,y) = a/(a+b) Sale + ¢) 


Sokal and Sneath Similarity Measure 4 


This yields the conditional probability that a characteristic of one item is in the same state (present 
or absent) as the characteristic of the other item. The measure is an average over both items 
acting as predictors. It has a range of 0 to 1. 


a/(a+b)+a/(a+c)+d/(b+d)+d/(c+d) 


SS4(r,y) = : 


Hamann Similarity Measure 


This measure gives the probability that a characteristic has the same state in both items (present 
in both or absent from both) minus the probability that a characteristic has different states in the 
two items (present in one and absent from the other). HAMANN has arange of —1 to +1 and is 
monotonically related to SM, SS1, and RT. 


(a+d)—(b+c) 


HAMANN (c = 
(7,4) utb+et+d 


Predictability Measures 


The following four binary measures assess the association between items as the predictability of 
one given the other. All four measures yield similarities. 
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Goodman and Kruskal Lambda (Similarity) 


This coefficient assesses the predictability of the state of a characteristic on one item (presence 
or absence) given the state on the other item. Specifically, lambda measures the proportional 


reduction in error using one item to predict the other, when the directions of prediction are of equal 
importance. Lambda has a range of 0 to 1. 


ty = max (a,b) + max (c,d) + max (a,c) + max (0, d) 
tg =max(a+c,b+d)+max(a+b,c+d) 
LAMBDA (;r, 1) = ———h=! 


~ 2(a+b+c+d)—fe 
Anderberg’s D (Similarity) 


This coefficient assesses the predictability of the state of a characteristic on one item (presence 
or absence) given the state on the other. D measures the actual reduction in the error probability 
when one item is used to predict the other. The range of Dis 0 to 1. 


ty = max (a,b) + max (c,d) + max (a,c) + max (0, d) 
tg =max(a+c,b+d)+max(a+6,c+d) 


9) — ——ti=te 
D(x,y) = 2(a+b+c+d) 


Yule’s Y Coefficient of Colligation (Similarity) 


This is a function of the cross-product ratio for a 2 x 2 table. It has a range of —1 to +1. 


Wade 
7 Vad + Vbe 


Y¥(2x,y) 


Yule’s Q (Similarity) 


This is the 2 x 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y, Qis a 
function of the cross-product ratio for a 2 x 2 table and has a range of —1 to +1. 
ad — be 


Q(x, y) = 


ad + be 
Other Binary Measures 


The remaining binary measures available in PROXIMITIES are either binary equivalents of 


association measures for continuous variables or measures of special properties of the relation 
between items. 


Ochiai Similarity Measure 


This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity measure. 
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Sokal and Sneath Similarity Measure 5 
This is a similarity measure. Its range is 0 to 1. 


ad 
SS5(ai4)/= zs 
y) Ja Ma+rcyo+Fdjc+re 


Fourfold Point Correlation (Similarity) 


This is the binary form of the Pearson product-moment correlation coefficient. Phi is a similarity 
measure, and its range is Oto 1. 


ad — be 


PHI (x,y) = 


J(e4 Natc(b+d\cte 


Binary Euclidean Distance 


This is a distance measure. Its minimum value is 0, and it has no upper limit. 


BEUCLID (7,1) = Vb+c 


Binary Squared Euclidean Distance 
This is also a distance measure. Its minimum value is 0, and it has no upper limit. 


BSEUCLID (.r, y) = b +e 


Size Difference 
This is a dissimilarity measure with a minimum value of 0 and no upper limit. 
(b—c)? 


SIZE(z,y) = ——_——_ 
(a+b+ée4¢4) 


Pattern Difference 


This is also a dissimilarity measure. Its range is 0 to 1. 


PATTERN (:, 1) = ee 
(a+b+c+d) 


PROXIMITIES Algorithms 
Binary Shape Difference 


This dissimilarity measure has no upper or lower limit. 


(a+b+et+d)(b+c)—(b—c)? 


BSHAPE (7,4) = 5 
(a+b+c+d) 


Dispersion Similarity Measure 


This similarity measure has a range of —1 to +1. 


ad — be 


DISPER (v7, y) = aan 
(a+b+c+d) 
Variance Dissimilarity Measure 


This dissimilarity measure has a minimum value of 0 and no upper limit. 


b+e 


NEE ge asa) 


Binary Lance-and- Williams Nonmetric Dissimilarity Measure 


Also known as the Bray-Curtis nonmetric coefficient, this dissimilarity measure has a range 


of 0 tol. 
b 
BLWMNG gS 
2a+b+e 
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PROXSCAL Algorithms 


PROXSCAL performs multidimensional scaling of proximity data to find a least-squares 
representation of the objects in a low-dimensional space. Individual differences models can be 
specified for multiple sources. A majorization algorithm guarantees monotone convergence for 
optionally transformed, metric and nonmetric data under a variety of models and constraints. 

Detailed mathematical derivations concerning the algorithm can be found in Commandeur and 
Heiser (1993). 


Notation 


The following notation is used throughout this chapter unless otherwise stated. For the dimensions 
of the vectors and matrices: 


Number of objects 

Number of sources 

Number of dimensions 

Number of independent variables 
Maximum(s, p) 

Length oftrans formation vector 
Degree of spline 

Number of interior knots for spline 
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The input and input-related variables are: 


Ax nxn matrix with raw proximities for source k 
Wi. nxn matrix with weights for source k 

E nxs matrix with raw independent variables 

F nxp matrix with fixed coordinates 


Output and output-related variables are: 


D; nxn matrix with transformed proximities for source k 

Z nxp matrix with common space coordinates 

Ax pp matrix with space weights for source k 

Xi nxp matrix with individuals pace coordinates for sourcek 

Q nxh matrix with transformed independent variables 

B hxp matrix with regression weights for independent variables 


S lx(r+t) matrix of coefficients forthe spline basis 


Special matrices and functions are: 
J 1 FS ; oe ; 
I— 11° /1° 1, centering matrix of appropriate size 
D(Xx) nxn matrix with distances, with elements {d;;,}, where 
dijk = »/(Xik — Xjk)(Xik — XjR) 


Vi —wijr fori $j 
m7 


nxn matrix with elements {v,;;,}. where v;;, = ee on 
: {visa} sala S wit: tori = J 


ii 
B(X;,) nxnxm matrix with elements {b;;,}. where 


Wisk Oi oo ° 
ee if dj;(X,) > Oandi ¥ j 


0 if d;;(X-) =O andi ¥ j 


So biz if i = j 


ifi 


ijk = 


Introduction 


The following loss function is minimized by PROXSCAL, 


- m nm . 2 
oF =1S°> wisn [din = di(Xx)| 


k=1i<j 
which is the weighted mean squared error between the transformed proximities and the distances 
of n objects within m sources. The transformation function for the proximities provides 
nonnegative, monotonically nondecreasing values for the transformed proximities d;;;,. The 


distances d; ;(X,),) are simply the Euclidean distances between the object points, with the 
coordinates in the rows of X;.. 


The main algorithm consists of the following major steps: 
1. find initial configurations X;,, and evaluate the loss function; 
2. find an update for the configurations X;,; 
3. find an update for the transformed proximities di; a 


4. evaluate the loss function; if some predefined stop criterion is satisfied, stop; otherwise, go to 
step 2. 


Preliminaries 


At the start of the procedure, several preliminary computations are performed to handle missing 
weights or proximities, and initialize the raw proximities. 


Missing Values 


On input, missing values may occur for both weights and proximities. If a weight is missing, it is 
set equal to zero. If a proximity is missing, the corresponding weight is set equal to zero. 
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Proximities 


Only the upper or lower triangular part (without the diagonal) of the proximity matrix is needed. 
In case both triangles are given, the weighted mean of both triangles is used. Next, the raw 
proximities are transformed such that similarities become dissimilarities by multiplying with -1, 
taking into account the conditionality, and setting the smallest dissimilarity equal to zero. 


Transformations 


For ordinal transformations, the nonmissing proximities are replaced by their ascending rank 
numbers, also taking into account the conditionality. For spline transformations, the spline basis 
S is computed. 


Normalization 


The proximities are normalized such that the weighted squared proximities equal the sum of the 
weights, again, taking into account the conditionality. 


Step 1: Initial Configuration 


PROXSCAL allows for several initial configurations. Before determining the initial configuration, 
missings are handled, and the raw proximities are initialized. Finally, after one of the starts 
described below, the common space Z is centered on the origin and optimally dilated in 
accordance with the normalized proximities. 


Simplex Start 


The simplex start consists of arank-p approximation of the matrix V~B(J). Set H, annxp 
columnwise orthogonal matrix, satisfying Hl n= I, where I,, denotes the matrix with the first 
p columns of the identity matrix. The nonzero rows are selected in such a way that the first 
Z=B(J)H contains the p columns of B(J) with the largest diagonal elements. The following steps 
are computed in turn, until convergence is reached: 


For a fixed Z, H=PQ?, where PQ? is taken from the singular value decomposition B(J)Z=PLQ!; 
For a fixed H, 7 — 2-!/2v-B (J) H, where vy - is the pseudo-inverse of V. 


For a restricted common space Z, the second step is adjusted in order to fullfill the restictions. 
This procedure was introduced in Heiser (1985). 


Torgerson Start 


The proximities are aggregated over sources, squared, double centered and multiplied with —0.5, 
after which an eigenvalue decomposition is used to determine the coordinate values, thus 


~0.53D*J = QAQT 
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where elements of D* are defined as 


m m. -1 
*. = sic [2 pas 
ai, = Wijk ai jr Wijk 

k=1 k=1 


followed by Z = QA!/?, where only the first p positive ordered eigenvalues \y> \2>...> Ax) 

and eigenvectors are used. This technique, classical scaling, is due to Torgerson (1952, 1958) and 

Gower (1966) and also known under the names Torgerson scaling or Torgerson-Gower scaling. 
(Multiple) Random Start 

The coordinate values are randomly generated from a uniform distribution using the default 

random number generator from IBM® SPSS® Statistics. 


User-Provided Start 


The coordinate values provided by the user are used. 


Step 2: Configuration Update 


The coordinates of the common space and the space weights (if applicable) are updated. 


Update for the Common Space 


The common space Z is related to the individual spaces X;, through the model X;, = ZA;, where 
A;, are matrices containing space weights. Assume that weight matrix A), is of full rank. Only 
considering Z defines the loss function as 


where 
z= vec(Z), 
H=2)°> (AAT Va), 
k=1 
t= vee( 25° B(X;,)X;A, 
k=1 


for which a solution is found as 
z-=Ht 


Several special cases exist for which the solution can be simplified. First, the weights matrices 
W), may all be equal, or even all equal to one. In these cases H will simplify, as will the pseudo- 
inverse of H. Another simplification is concerned with the different models, reflected in 
restrictions for the space weights. This model is the generalized Euclidean model, also known as 
IDIOSCAL (Carroll and Chang, 1972). The weighted Euclidean model, or INDSCAL, restricts 
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A, to be diagonal, whichdoes simplify H, but not the pseudo-inverse. The identity model requires 
A; =I for all k, and does simplify H and its pseudo-inverse, for the kronecker product vanishes. 

To avoid computing the pseudo-inverse of a large matrix, PROXSCAL uses three technical 
simplifications when appropriate. First, the pseudo-inverse can be replaced by a proper inverse by 
adding the nullspace, taking the proper inverse and then subtracting the nullspace again as 


H- =(H+N)!-N 


where N = (11! / Ta): Furthermore, a dimensionwise approach (Heiser and Stoop, 1986) is 
used which results in a solution for dimension a of Z as 


Va=2)>_Vied A, Ajes, 
k=1 


where e, is the ath column of an identity matrix, and 


Za TF ae B (X;,) X, Az = ViPaArAl ea, 
k=1 


with P,, annxp matrix equal to Z, but with the ath column containing zeros. 

Still, the proper inverse of a nxn matrix is required. The final simplification is concerned with 
a majorization function in which the largest eigenvalue of V allows for an easy update (Heiser, 
1987; Groenen, Heiser, and Meulman, 1999). Instead of the largest eigenvalue itself, an upper 
bound is used for this scalar (Wolkowicz and Styan, 1980). 


Update for the Space Weights 


An update for the space weights A,.(/ = 1,...,7) for the generalized Euclidean model is given by 


TV ,Z) (27 B(X,)X:) 


ApS (z 
Suppose P;, L;. ql is the singular value decomposition of A,. for which the diagonal matrix with 
singular values L,, is in nonincreasing order. Then, for the reduced rank model, the best r(r<p) 
rank approximation of A, is given by R,T/, where R,;, contains the first r columns of P;,L;,, and 
T;, contains the first r columns of Q,.. 


For the weighted Euclidean model, the update reduces to a diagonal matrix 


A,, = diag (27 v,2) ‘diag (ZT B(X,)Xx) 


The space weights for the identity model need no update, since A; = I for all k. Simplifications 
can be obtained if all weights W are equal to one and for the reduced rank model, which can be 
done in r dimensions, as explained in Heiser and Stoop (1986). 
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Restrictions 


The user can impose restrictions on the common space by fixing some of the coordinates or 
specifying that the common space is a weighted sum of independent variables. 


Fixed Coordinates 


If some of the coordinates of Z are fixed by the user, then only the free coordinates of Z need to be 
updated. The dimensionwise approach is taken one step further, which results in an update for 
object i on dimension a as 


m p m 
1 T T Ty 5 
yt — 1 = Bl | ae j 
Lig Wy. oe : Fs (X;},) X;, Ale 2) (>. ej A,.Aj. evi] Zi lea V ahia 
k=1 j#a \k=1 ‘2 
where the ath column of Z is divided into z,, = 2, + z;,e;, With e; the ith column of the identity 
m™m 


matrix, and V,, = tN" ef A, Ape, Vi. 


This update procedure will only locally minimize the loss function, and repeatedly cycling 
through all free coordinates until convergence is reached, will provide global optimization. After 
all free coordinates have been updated, Z is centered on the origin. On output, the configuration is 
adapted as to coincide with the initial fixed coordinates. 


Independent Variables 


S 


Independent variables Q are used to express the coordinates of the common space Z as a weighted 
sum of these independent variables as 


h 
Z=QB=)_ qb} 


j=l 


An update for Z is found by performing the following calculations for j=1,...,h: 


C7 LS7V,U;A. AE, where C = LS B(X,)X, AP 
k=1 k=1 
ai -1 
update b; as bj = (45. aia Aca Thq, 
k=1 


optionally, compute optimally transformed variables by regressing q; = ee Lib; : ( - EV; ) ay 
m 


where V ; = t\— bl A, AlbjV ,, and kj is greater than or equal to the largest eigenvalue of 


= k=1 
V,, on the original variable q;. Missing elements in the original variable are replaced with 


the corresponding values from q;. 
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h 
Finally, set Z = QB = ye qybl. 


j=l 


Independent variables restrictions were introduced for the MDS model in Bentler and Weeks 
(1978), Bloxom (1978), de Leeuw and Heiser (1980) and Meulman and Heiser (1984). If there are 
more dimensions (p) than independent variables (s), p—s dummy variables are created and treated 
completely free in the analysis. The transformations for the independent variables from Step 4 are 
identical to the transformations of the proximities, except that the nonnegativety constraint does 
not apply. After transformation, the variables q are centered on the origin, normalized on n, and 
the reverse normalization is applied to the regression weights b. 


Step 3: Transformation Update 


The values of the transformed proximities are updated. 


Conditionality 


Two types of conditionalities exist in PROXSCAL. Conditionality refers to the possible 
comparison of proximities in the transformation step. For unconditional transformations, 

all proximities are allowed to be compared with each other, irrespective of the source. 
Matrix-conditional transformations only allow for comparison of proximities within one matrix k, 
in PROXSCAL refered to as one source k. Here, the transformation is computed for each source 
seperately (thus m times). 


Transformation Functions 


All transformation functions in PROXSCAL result in nonnegative values for the transformed 
proximities. After the transformation, the transformed proximities are normalized and the 
common space is optimally dilated accordingly. The following transformations are available. 


Ratio. D = A. No transformation is necessary, since the scale of D is adjusted in the 
normalization step. 


Interval.D = a + 3A Both a and f are computed using linear regression, in such a way that both 
parameters are nonnegative. 


Ordinal. D =WMON (A, W). Weighted monotone regression (WMON) is computed using the 
up-and-down-blocks minimum violators algorithm (Kruskal, 1964; Barlow et al., 1972). For the 
secondary approach to ties, ties are kept tied, the proximities within tieblocks are first contracted 
and expanded afterwards. 


Spline.vec ( D) = Sh PROXSCAL uses monotone spline transformations (Ramsay, 1988). 

In this case, the spline transformation gives a smooth nondecreasing piecewise polynomial 
transformation. It is computed as a weighted regression of D on the spline basis S. Regression 
weights b are restricted to be nonnegative and computed using nonnegative alternating least 
squares (Groenen, van Os and Meulman, 2000). 
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Normalization 


After transformation, the transformed proximities are normalized such that the sum-of-squares 
of the weighted transformed proximities are equal to mn(n—1)/2 in the unconditional case and 
equal to n(n—1)/2 in the matrix-conditional case. 


Step 4: Termination 


After evaluation of the loss function, the old function value and new function values are used to 
decide whether iterations should continue. If the new function value is smaller than or equal to 
the minimum Stress value MINSTRESS, provided by the user, iterations are terminated. Also, if 
the difference in consecutive Stress values is smaller than or equal to the convergence criterion 
DIFFSTRESS, provided by the user, iterations are terminated. Finally, iterations are terminated if 
the current number of iterations, exceeds the maximum number of iterations MAXITER, also 
provided by the user. In all other cases, iterations continue. 


Acceleration 


For the identity model without further restrictions, the common space can be updated with 
acceleration as Ze — 2zUpdate_ old alco referred to as the relaxed update. 


Lowering Dimensionality 


For a restart in p—1 dimensions, the p—1 most important dimensions need to be identified. For 
the identity model, the first p—1 principal axes are used. For the weighted Euclidean model, the 
p-1 most important space weights are used, and for the generalized Euclidean and reduced rank 
models, the p—1 largest smgular values of the space weights determine the remaining dimensions. 


PROXSCAL Algorithms 


Stress Measures 


The following statistics are used for the computation of the Stress measures: 


n> (X) SS eyed Xi.) 


k=1 i<j 


=> wijnd (Xx) 


k=1 i<j 
™m n 


= > >: wijrdijed;; (Xx) 
k=1 t<j 


m ” 


=) ) wijnd?, 4 de, ( X;) 
k=1 i<j 
min 2 


oP wyn(dy(Xe) ~ 300) 


where d (X.) is the average distance. 


The loss function minimized by PROXSCAL, normalized raw Stress, is given by: 
2 


D +n?(aX)—2p(aX) 
oat 2) , witha = 4 
n2(D) UX)" 


Note that at a local minimum of X, a is equal to one. The other Fit and Stress measures provided 
by PROXSCAL are given by: 


n (D)+1?(aX)—2p(aX) _ 1°(D) 
Stress-I: 72 (aX) , with © = FrxX7. 

n2(B)+n?(aX)—2p(aX) ao 1° (D) 
Stress-II: 2 (aX) , with = ZX. 


S-Stress: n*(D) tni(aX} 2p?(aX), with a? = ae 


Dispersion Accounted For (DAF): ; _ 2. 


Tucker’s coefficient of congruence: 1-02. 


Decomposition of Normalized Raw Stress 


Each part of normalized raw Stress, as described before, is assigned to objects and sources. Either 
sum over objects or sum over sources are equal to total normalized raw Stress. 


PROXSCAL Algorithms 


Transformations on Output 


On output, whenever fixed coordinates or independent variables do not apply, the models are not 
unique. In these cases transformations of the common space and the space weights are in order. 


For the identity model, the common space Z is rotated to principal axes. For the weighted 
Euclidean model, Z = \/nZ (deg zi zZ) az so that diag (z'z) = nl, and reverse transformations 
are applied to the space weights A,.. Further, the sum over sources of the squared space weights 

are put in descending order as to specify the importance of the dimensions. For the generalized 
Euclidean model, the Cholesky decomposition z'z= uu! specifies the common space on 
output as Z = vaz(LT) so that ZZ = nb 
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QUICK CLUSTER Algorithms 


When the desired number of clusters is known, QUICK CLUSTER groups cases efficiently 
into clusters. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 83-1 
Notation 
Notation Description 
NC Number of clusters requested 
M; Mean of ith cluster 
Xk Vector of kth observation 
d(xi, X)) Euclidean distance between vectors x; and x; 
din min;,;d(M,;,M;) 
€ Convergence criteria 
Algorithm 


Step 1: 


The first iteration involves three steps. 


Select Initial Cluster Centers 


To select the initial cluster centers, a single pass of the data is made. The values of the first 
NC cases with no missing values are assigned as cluster centers, then the remaining cases are 
processed as follows: 


If min;d(x;,M;) > dinn and d(x;,M.,,,) > d(x;,M,,), then x; replaces M,,. If 

min;d(x;,M;) > dmn and d(x;.,M,,,) < d(x;,,M,,), then x; replaces M,,,; that is, if the distance 
between x;, and its closest cluster mean is greater than the distance between the two closest means 
(M,,, and M,,), then x;, replaces either M,,, or M,,, whichever is closer to x;. . 


If x;, does not replace a cluster mean in (a), a second test is made: 

Let M, be the closest cluster mean to x;.. 

Let M,, be the second closest cluster mean to x;. 

If d(x;.,M,,) > minjd(M,,M,;), then M, = x;; 

That is, if x;. is further from the second closest cluster’s center than the closest cluster’s center is 
from any other cluster’s center, replace the closest cluster’s center with x;.. 


At the end of one pass through the data, the initial means of all NC clusters are set. Note that if 
NOINITIAL is specified, the first NC cases with no missing values are the initial cluster means. 


QUICK CLUSTER Algorithms 


Step 2: Update Initial Cluster Centers 


Starting with the first case, each case in turn is assigned to the nearest cluster, and that cluster 
mean is updated. Note that the initial cluster center is included in this mean. The updated cluster 
means are the classification cluster centers. 


Note that if NOUPDATE is specified, this step is skipped. 


Step 3: Assign Cases to the Nearest Cluster 


The third pass through the data assigns each case to the nearest cluster, where distance from a 
cluster is the Euclidean distance between that case and the (updated) classification centers. Final 
cluster means are then calculated as the average values of clustering variables for cases assigned 
to each cluster. Final cluster means do not contain classification centers. 


When the number of iterations is greater than one, the final cluster means in step 3 are set to the 
classification cluster means in the end of step 2, aad QUICK CLUSTER repeats step 3 again. The 
algorithm stops when either the maximum number of iterations is reached or the maximum change 
of cluster centers in two successive iterations is smaller than times the minimum distance among 
the initial cluster centers. 
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RANK Algorithms 


RANK produces new variables containing ranks, normal scores, and Savage and related scores 
for numeric variables. 


Notation 


Let y; < yo < +++ < ym bem distinct ordered observations for the sample and C,, C2, ...,C,, be 
the corresponding sum of caseweights for each value. Define 


i 
CGr= > C;, = cumulative sum of caseweights up to yj; 
k=1 


m 
Mac. = FZ C;, = total sum of caseweights 
k=1 


Statistics 


The following statistics are available. 


Ran 
k beats Ae 
A rank is assigned to each case based on four different ways of treating ties or caseweights not 
equal to 1. 
For every i, i = 1,...,m, 
(a) ifC; >1 
Calculation Condition 
R;=CCi-1+1 if TIES = LOW 
R, = CC; if TIES = HIGH 
R; = CCi-1 + (Ci + 1)/2 if TIES = MEAN 
Ri =i if TIES = CONDENSE 
(b) if C; <1 
Calculation Condition 
Ri = CCi-1 if TIES = LOW 
R;i = CC; if TIES = HIGH 
R; = CCj-1 + Ci /2 if TIES = MEAN 


R, =i if TIES = CONDENSE 


RANK Algorithms 
Note: CCo =0 


RFRACTION 


Fractional rank: 


RF, = Ri/W,t=1,..., m 


PERCENT 


Fractional rank as a percentage: 


P,=%x100,i=1,..., m 


PROPORTION Estimate for Cumulative Proportion 


The proportion is calculated for each case based on four different methods of estimating fractional 


rank: 

Calculation Method 

F; = (Ri — 8)/(W + 4) (BLOM) 

F, = (Ri — 4)/W (RANKIT) 

R= (RDI) (TUKEY) 

F, = Ri/(W +1) (Van der Waerden) 


Note: F; will be set to SYSMIS if the calculated value of F; by the formula is negative. 


NORMAL (a) 


Normal scores that are the Z-scores from the standard normal distribution that corresponds to the 
estimated cumulative proportion F. The normal score is defined by 


a; = W(Fj), t=1,...,m 


where W is the inverse cumulative standard normal distribution (PROBIT). 


NTILES (K) 


Assign group membership for the requested number of groups. If K groups are requested, the 
n tile (.V;) for case i is defined by 


Rw 
N; = |— 
Ee | 


RANK Algorithms 


RK 
W-+1 


where is the greatest integer that is less than or equal to 2; AK /(W +1). 


SAVAGE (S) 


Savage scores based on exponential distribution. The Savage score is calculated by 


is 
(1—gi)lutitgGalatit S> GI/Cip—-1 tt2<i2 


Si = j=iit2 
{((1 — gi, li,4a + Giglino1)/Ci} — 1 ip +1=io 
Gai —1 11 = 19 
where 
Ling. yy. vx JW if W is an integer 
i =[CCi-r], t2=[CCi], W*= { (W) +1  ifW is not an integer 


Gi, = CCi-1 — 11, Gi, = CC; — i2 


and /,,...,/,,- are defined as the expected values of the order statistics from an exponential 
distribution; that is 


j 1 
=] — 
j pe W*_-K+1 
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RATIO STATISTICS Algorithms 


This procedure provides a variety of descriptive statistics for the ratio of two variables. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 85-1 
Notation 


Notation Description 


n Number of observations 
Aj Numerator of the Ith ratio (i=1,...,n). This is usually the appraisal roll value. 
Si Denominator of the ith ratio (i=1,...,n). This is usually the sale price. 
R; The ith ratio (i=1,...,n). Often called the appraisal ratio. 
fi Case weight associated with the ith ratio (i=1,...,n). 
Data 
This procedure requires for i = 1, ..., n that: 
mw A, >0, 


a S;>0, 

m f/f; >0, and 

™ w;is a whole number. If the Weight variable contains fractional values, then only the integral 
parts are used. 


A case is considered valid if it satisfies all four requirements above. This procedure will use only 
valid cases in computing the requested statistics. 


Ratio Statistics 


The following statistics are available. 


Ratio 


Aj 
R; = —,i=1,...,2 
oh 


Minimum 


The smallest ratio and is denoted by Rypin. 


RATIO STATISTICS 


Maximum 


The largest ratio and is denoted by Rinax. 


Range 


The difference between the largest and the smallest ratios. It is equal to Rynax — Rmin- 


Median 


The middle number of the sorted ratios if n is odd. The mean (average) of the two middle ratios if 
the n is even. The median is denoted as R. 


Average Absolute Deviation (AAD) 


n tht 
AAD =) fi\Ri- RID f 
i=l 


| 


Coefficient of Dispersion (COD) 


COD = 100% x ve 


Coefficient of Concentration (COC) 


Given a percentage 100% ~ g, the coefficient of concentration is the percentage of ratios falling 
within the interval [(1 —g)R, (1+ g)R]. The higher this coefficient, the better uniformity. 


Mean 


n 
A/S=R= 3 FiRil > fi 
i=1 i=1 


Standard Deviation (SD) 


1 ~~ —=\2 
Sl Gay ya Silt me, 


where F = Ss fis 
i=1 


RATIO STATISTICS Algorithms 
Coefficient of Variation (COV) 


COV = 100% x a 


R 
Weighted Mean 
n n 
Sar. Yo Roe 
al = i=1 
1/S = = n 


This is the weighted mean of the ratios weighted by the sales prices in addition to the usual 
case weights. 


Price Related Differential (a.k.a. Index of Regressivity) 


2) 


PRp = 2/2 
A/S 


This is quotient by dividing the Mean by the Weighted Mean. 


Property appraisals sometimes result in unequal tax burden between high-value and low-value 
properties in the same property group. Appraisals are considered regressive if high-value 
properties are under-appraised relative to low-value properties. On the contrary, appraisals are 
considered progressive if high-value properties are relatively over-appraised. The price related 
differential is a measure for measuring assessment regressivity or progressivity. Hence the price 
related differential is also known as the index of regressivity. 


Recall that the [unweighted] mean weights the ratios equally, whereas the weighted mean high- 
value properties are under-appraised, thus pulling the weighted mean below the mean. On the 


other hand, if the PRD is less than 1, high-value properties are relatively over-appraised, 
pulling the weighted mean above the mean. 


Confidence Interval for the Median 


The confidence interval can be computed under the assumption that the ratios follow a normal 
distribution or nonparametrically. 


Distribution free (nonparametric) 


Given the confidence level 100%x(1 — a), the confidence interval for the median is an 
(Rij, Ri,—- 41) interval such that 


RATIO STATISTICS 


l-a=1-2h;(n—-r+1r)=z ce) 
where /);,) is the 100%xk/n quantile, and Jp.;(n — r + 1,7) is the incomplete Beta function. 


An equivalent formula is 


5 =los(n-—r+ 1r)= =>) a 


k=O 


Since the rightmost term is the cumulative Binomial distribution and it is discrete, r is solved as 
the largest value such that 


Thus the confidence interval has coverage probability of at least 1 — a. 
Normal distribution 


Assuming the ratios follow a normal distribution, a two-sided 100%x(1 — @) confidence interval 
for a median of a normal distribution is 


( ier J(a/2:0.5,d) x sR + 9(1-a/2;0.5,d) x s) 


where gy,,.,,,;) are values defined in Table 1 of Odeh and Owen (1980). 


The value g,,..,,,:) is, in fact, the solution to the following equations: 
Pr (Ty < gVn|5 = Kp) =7 


with 7); follows a noncentral Student t-distribution where d is degrees of freedom associated with 
the standard deviation s, 6 is noncentrality parameter, y is the probability, n is the sample size, and 
ky, is the upper p percentile point of a standard normal distribution. 


Confidence Interval for the Mean 


The normal distribution is used to approximate the distribution of the ratios. The 
100%x(1 — a) confidence interval for the mean is: 


R+ ta /2;F-1 x s/VF 


where ¢,, /2.—1 is the upper a/2 percentage point of the t distribution with / — 1 degrees of 
n 


freedom, and where F = a fe 


i=1 


RATIO STATISTICS Algorithms 


Confidence Interval for the Weighted Mean 


Using the Delta method, variance of the weighted mean is approximated as 


S- S° 3? 


_ var(A 2Acov(A,S A’var(S 
var(4) (A) I ) : skal 4 
where 


var (A) = ro, fi (Ai A)’ x >> P/F’. 
i=1 
var (S) = ra), fi(Si 3) x Soi. and 
i=l i=1 
cou(A, S) = Fay), fi(Ai = A) (S; = S) x >> £F/F?. 
i=] i=] 
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RBF Algorithms 


A radial basis function (RBF) network is a feed-forward, supervised learning network with only 
one hidden layer, called the radial basis function layer. The RBF network is a function of one 
or more predictors (also called inputs or independent variables) that minimizes the prediction 
error of one or more target variables (also called outputs). Predictors and targets can be a mix 
of categorical and scale variables. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


xi — ( ai™ 2) Input vector, pattern m, m=1....M. 
Py yp tery Dp 


yim) — ( y\ OO ose un”) Target vector, pattern m. 


R 

I Number of layers, discounting the input layer. For an RBF network, F2. 

Ji Number of units in layer i. Jo = P, J, = R, discounting the bias unit. J; 

is the number of RBF units. 

0; (xm) jth RBF unit for input X°"), j=1, ...Jj. 

Hy center of @;. it is P-dimensional. 

oj width of @;, it is P-dimensional. 

h the RBF overlapping factor. 

ay Unit j of layer i, pattern m, j = 0,..., Ji;i = 0,..., 0. 

Wrj weight connecting rth output unit and jth hidden unit of RBF layer. 
Architecture 


There are three layers in the RBF network: 
Input layer: Jo=P units. ao.1,---,@0:y,: With ao,; = z;. 


RBF layer: J; units. . a1.1,--+,@1.7,: With a;,; = @; (X) and ¢; (X) described below. 


Ja 
Output layer: J>=R units. ay.),+-+,@7.7,: With ay. = wyo + > wri; (X). 
j=l 
There are many types of radial basis functions; there are two distinct types of Gaussian RBF 
architectures that we support: 


Ordinary RBF (ORBF): This type uses the exp activation function, so the activation of the RBF 
unit is a Gaussian “bump” as a function of the inputs. In ORBF, the Gaussian basis function takes 
form 


0; (X) = exp y dee (wp oar 


RBF Algorithms 


Normalized RBF (NRBF): This type uses the softmax activation function, so the activation of all 
the RBF units are normalized to sum to one. In NRBF networks, the basis function takes form 


Error Function 


Sum-of-squares error is used: 


The sum-of-squares error function with identity activation function for output layer can be 


™m. 


used for both scale and categorical targets. For scale targets, «7. approximates the conditional 
expectation of the target value E (y,|.X"")). For categorical targets, aj’, approximates the 
posterior probability of class k: P (yr =1)x ). 


Note: though Saj',.= 1 (the sum is over all classes of the same categorical target 
variable), a’. may not lie in the range [0, 1]. 


Training 
The network is trained in two stages: 


1. Determine the basis functions by clustering methods. The center and width for each basis 
function is computed. 


2. Determine the weights given the basis functions. For the given basis functions, 
compute the ordinary least-squares regression estimates of the weights. 


The simplicity of these computations allows the RBF network to be trained very quickly. 


RBF Algorithms 


Determining Basis Functions 


The two-step clustering algorithm is used to find the RBF centers and widths. For each cluster, 
the mean and standard deviation for each scale variable and proportion of each category for 
each categorical variable are derived. Using the results from clustering, the center of the jth 
RBF is set as: 


_ _ J %j, if pth variable is scale 
Mjp 7» if pth variable is a dummy variable of a categorical variable 


where 7 ;,, is the jth cluster mean of the pth input variable if it is scale, andr;,, is the proportion 
of the category of a categorical variable that the pth input variable corresponds to. The width of 
the jth RBF is set as 


p 1/2 Sip if pth variable is scale 
as eae Pjp(1—pjp) if pth variable is a dummy variable of a categorical variable 


where sj, is the jth cluster standard deviation of the pth variable and h>0 is the RBF overlapping 
factor that controls the amount of overlap among the RBFs. Since some c,,, may be zeros, we 
use spherical shaped Gaussian bumps; that is, a common width 


JP 


in for all predictors. In the case that 7; is zero for some j, set it to be min {aj : a; # Cee, If all 
a; are zero, set all of them to be Vh. 


P 
When there are a large number of predictors, Ss (te ite )° could be easily very large and hence 
p=l1 


exp “Ya ag2 (* igs ) | is practically zero for every record and every RBF unit if a; is 


relatively. Swiall. This is especially bad for ORBF because there would be only a constant term in 
the model when this happens. To avoid this, 7; is increased by setting the default overlapping 
factor h proportional to the number of inputs: He 1+0.1 P. 


For more information, see the topic “T[WOSTEP CLUSTER Algorithms”. 


Automatic Selection of Number of Basis Functions 


The algorithm tries a reasonable range of numbers of hidden units and picks the “best”. By 
default, the reasonable range [K1, K>] is determined by first using the two-step clustering method 


to automatically find the number of clusters, K. Then set Ky = min(K, R) for ORBF and Kj 
=max{2, min(K, R)} for NRBF and Ko=max(10, 2K, R). 


RBF Algorithms 
If a test data set is specified, then the “best” model is the one with the smaller error in the test 
data. If there is no test data, the BIC (Bayesian information criterion) is used to select the “best” 


model. The BIC is defined as 


BIC = MRIn(MSE)+kln(M) 


M R rc 
where MSE = 375 2 » (ui = aj", ) is the mean squared error and k= (P+1+R)J; for 
NRBF and (P+1+R)J{+R for ORBF is the number of parameters in the model. 

Output Statistics 


The following output statistics are available. Note that, for scale variables, output statistics are 
reported in terms of the rescaled values of the variables. 


Sum-of-Squares Error 


As described in “Error Function”. The cross entropy error is displayed if the output layer 
activation function is softmax, otherwise the sum-of-squares error is shown. 


Relative Error 


For each scale target r: 


M (m) ~(m) 2 
paar Ur — Ur 
M (m) _ 2 
a oi Ur) 


For each categorical target r, report p,., the percent of incorrect predictions 


Average Overall Relative Error 


If there is at least one scale target: 


MR 
pap (m) jr”) 
Ur —_ r 
m=lr=1 
M R 
(m) =a 2 
DD (yr 5, 
m=lr=1 


(mm) 


where 7, is the mean of y; ' over patterns. 
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If all targets are categorical, report the average percent of incorrect predictions: 


where C is the number of categorical variables. 


Sensitivity Analysis 


For each predictor p and each input pattern m, compute: 


; *-(m) “(172) 
dpm = MaXy, .2,,€Spll Ypi <= Yoo | 
where Y;,,"’ is the predicted output vector (standardized if standardization of output 


(m) 


variable is used in training) using ( as rake’ 


gery p— 1) 


(m) _,(m) 


Lppy Ly yyy Lp ) as its input, and 


r“p 9p 
{(1,0,...,0),(0,1,0,...,0),...,(0,0,...,1)} for categorical predictors. 
Then compute: 


M 
dy = VW oS dm 


m=1 


Sp = ae rae x2) Pacem gaa for scale predictors and 


and normalize the d,s to sum to 1, and report these normalized values as the sensitivity values for 
the predictors. This is the average maximum amount we can expect the output to change based 
on changes in the pth predictor. The greater the sensitivity, the more we expect the output to 
change when the predictor changes. 
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REGRESSION Algorithms 


This procedure performs multiple linear regression with five methods for entry and removal 
of variables. It also provides extensive analysis of residual and influential cases. Caseweight 
(CASEWEIGHT) and regression weight (REGWGT) can be specified in the model fitting. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 87-1 
Notation 


Notation Description 


Yi Dependent variable for case iwith variance o” / 4g; 
Ci Caseweight for case i; c; = 1 if CASEWEIGHT is not specified 
gi Regression weight for case i; g; = 1 if REGWGT is not specified 
l Number of distinct cases 
Wi Cigi 
Ww 1 
Sw 
i=l 
P Number of independent variables 
C 1 
Sum of caseweights: S- Ci 
i=1 
Thi The kth independent variable for case i 
X; _ : ; 
Sample mean for the kth independent variable: X;. = By wire | /W 
i=1 
Y 1 
Sample mean for the dependent variable: Y = ( Da_u w] /M 
i=1 
hi Leverage for case i 
h; i + hy 
Shei Sample covariance for X;, and X; 
Syy Sample variance for Y 
Sky Sample covariance for X;, and Y 
Pp Number of coefficients in the model. p” = pif the intercept is not included; otherwise 


Pp =p 1 
R The sample correlation matrix for X,,...,X, and Y 
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Descriptive Statistics 


ry ses P1plly 
im) ee TIT 
R= 21 = 2y 
ryl eee Typlyy 
where 
. Ski 
hy = Y af 
V Skk Oi 
and 
r r Sky 
k=lky = eS 
. V Ske yy 


The sample mean X ; and covariance S;; are computed by a provisional means algorithm. Define 


k 
WwW, = ‘> w, = Cumulative weight up to case k. 
i=l 
then 
= = Wk 
X i(k) = Xi(n—1) 4 (wig X i(k-1)) => 
W, 
where 
Xi1) = Til 


If the intercept is included, 


2d 
_ a we 
Cie) = Cij(k—1) + (tix — Xice—1)) (jk — X 54-1) @ = mE | 


where 
Cig) = 9 
Otherwise, 


Cigj(k) = Cij(k-1) + Whine jk 
where 


rast = Wy ri LU iL 
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The sample covariance 5;; is computed as the final C’;; divided by C’ — 1. 


Sweep Operations (Dempster, 1969) 


For a regression model of the form 
Y; = Bo + BiX1i + BoXai + +++ + BpXpi + e; 


sweep operations are used to compute the least squares estimates b of ,3 and the associated 
regression statistics. The sweeping starts with the correlation matrix R. Let R be the new matrix 
produced by sweeping on the kth row and column of R. The elements of R are 


Pik — ore 1 f- kh 
ou Tins Cee ae 
Vey = = J a k 


Trek 


If the above sweep operations are repeatedly applied to each row of Ry, in 
R= (Ru Ri ) 
21 22 
where Rj; contains independent variables in the equation at the current step, the result is 


R= Rit -Ry Ry 
RoiRy Re — RaR Ris 


The last row of 

RyRy 

contains the standardized coefficients (also called BETA), and 
Roo — Ro Ri Rie 


can be used to obtain the partial correlations for the variables not in the equation, controlling for 
the variables already in the equation. Note that this routine is its own inverse; that is, exactly the 
same operations are performed to remove a variable as to enter a variable. 


Note: When the stepwise or forward entry method is used, the variable order in the swept 
correlation matrix described above might differ from the variable order in the Swept Correlation 
Matrix table displayed in the output. 
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Variable Selection Criteria 


Let r;; be the element in the current swept matrix associated with X; and Xj. Variables are 
entered or removed one at a time. X;, is eligible for entry if it is an independent variable not 
currently in the model with 


rpp = t (tolerance with a default of 0.0001) 


and also, for each variable ; that is currently in the model, 


raLl hi 
ris _ J J t = 1 
= Vek 


The above condition is imposed so that entry of the variable does not reduce the tolerance of 
variables already in the model to unacceptable levels. 


The F-to-enter value for X;, is computed as 


(C —p* —1)Vy 


F —to— enter, = : v7 
yy OVA 


with 1 and C’ — p* — 1 degrees of freedom, where p* is the number of coefficients currently in 
the model and 
ryk Pky 


Vi = 
TKR 


The F-to-remove value for X;, is computed as 


(C= p*)|Va| 


F —to— remove, 
ryy 


with 1 and C’ — p* degrees of freedom. 


Methods for Variable Entry and Removal 


Five methods for entry and removal of variables are available. The selection process is repeated 
until the maximum number of steps (MAXSTEP) is reached or no more independent variables 
qualify for entry or removal. The algorithms for these five methods are described in the following 
sections. 


Stepwise 


If there are independent variables currently entered in the model, choose X;, such that 

F —to— remove; is minimum. X;, is removed if F — to — remove; < Four (default = 2.71) or, if 
probability criteria are used, P(F’ — to— remove,,) > Pour (default = 0.1). If the inequality does 
not hold, no variable is removed from the model. 
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If there are no independent variables currently entered in the model or if no entered variable 
is to be removed, choose X;, such that F — to — enter), is maximum. X;, is entered if 
F —to—enter,, > Fin (default = 3.84) or, P(F — to — enter;,) < Pin (default = 0.05). If the 


inequality does not hold, no variable is entered. 


At each step, all eligible variables are considered for removal and entry. 


Forward 


This procedure is the entry phase of the stepwise procedure. 


Backward 


This procedure is the removal phase of the stepwise procedure and can be used only after at least 
one independent variable has been entered in the model. 


Enter (Forced Entry) 


Choose X;, such that r;,; is maximum and enter X;,. Repeat for all variables to be entered. 


Remove (Forced Removal) 


Choose X;, such that r;,;, is minimum and remove X;,.. Repeat for all variables to be removed. 


Statistics 


The following statistics are available. 


Summary 


For the summary statistics, assume p independent variables are currently entered in the equation, 
of which a block of q variables have been entered or removed in the current step. 


Multiple R 


R= V/1—-Tyy 


R Square 


R?=1-Ty 
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Adjusted R Square 


2 _ p2 (1—R*)p 
Raa =F = C5" 


R Square Change (when a block of q independent variables was added or removed) 


2 2 2 
AR’ = fieainreve =< ft 


previous 


F Change and Significance of F Change 
AR?(C—p’) 


—i-R:) _ for the addition of q independent variables 
AF = q\ = current) 
AR*(C—p*—q) 


Ce 1) for the removal of g independent variables 
Q\ Ho previous ~ 


the degrees of freedom for the addition are q and C’ — p*, while the degrees of freedom for the 
removal are q and C’ — p* —q. 
Residual Sum of Squares 


SS.= Tg (C= 1s 


Sy y 
with degrees of freedom C' — p* 
Sum of Squares Due to Regression 
Sse RR (CANS, 
with degrees of freedom p. 
ANOVA Table 


Table 87-2 
ANOVA table 


Analysis of Variance 


| df | Sum of Squares Mean Square 
Regression SSr (SSr)/p 
w Cer SS, (SS.)/(C — p*) 


Standard Error of Estimate 


Also known as the standard error of regression, this is simply the square root of the mean square 
residual from the ANOVA table, or \/(S'S.)/(C — p*). 


Variance-Covariance Matrix for Unstandardized Regression Coefficient Estimates 


A square matrix of size p with diagonal elements equal to the variance, the below diagonal 
elements equal to the covariance, and the above diagonal elements equal to the correlations: 


PiekVyy Sy y 
var(bp) = Soleo) 
Tei Tyy Oy 


cou(b.. b;) = —teitwewy 
(Dk By) / Sie'S55(C—p") 
Pri 


VTEKT I, 


cor br, bj) = 


Selection Criteria 


The following selection criteria are available. 
Akaike Information Criterion (AIC) 


AIC =Cln (=) + 2p* 


Amemiya’s Prediction Criterion (PC) 


(1 — R?)(C +p*) 
C — p* 


Po = 


Mallow’s CP 


car ee 
OP = = +2" - 0 


where ¢? is the mean square error from fitting the model that includes all variables specified 
or implied across all METHOD subcommands. 


Schwarz Bayesian Criterion (SBC) 


YO 


IVE 


SBC =Cln (= + p*In(C) 


Collinearity 


The following measures of collinearity are available. 
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Variance Inflation Factors 


1 
VIF, = — 


rit 
Tolerance 
Tolerance; = rj; 
Eigenvalues 


The eigenvalues of scaled and uncentered cross-product matrix for the independent variables in 
the equation are computed by the QL method (Wilkinson and Reinsch, 1971). 


Condition Indices 
max Aj; 
Ik = —— 
MES Te 
Variance-Decomposition Proportions 


Let 


vi (Vit) ++ +5 Vip) 


be the eigenvector associated with eigenvalue \;. Also, let 
Pp 
6; =v7,/Ai and 6; =) 6; 
i=l 


The variance-decomposition proportion for the jth regression coefficient associated with the 
ith component is defined as 


Tij = ;;/P; 


Statistics for Variables in the Equation 


The following statistics are computed for each variable in the equation. 


Regression Coefficient 


— TukVSw fork =1 ) 
Ip. = ——= = sey 
by = ’ L 


The standard error of b;. is computed as 


keV yy yy 


Sx(C — p*) 
95% confidence interval for coefficient 


bp + 04,,t0.975,C—p* 


If the model includes the intercept, the intercept is estimated as 


The variance of bo is estimated by 


7) (C' = 1) )P yy 7 e 2.9 £ ae 
om = “CO — pF) t S> Xap, +2 S> SO XpXjest.cov(bj, bj) 
k=1 k=j+1 j=l 


Beta Coefficients 
Beta = ryk 


The standard error of Beta), is estimated by 


Pyylkk 
C — p* 


Oo Beta, = 
F-test for Beta, 
2 
Beta, 
F- ( | :) 
Oo Beta, 
with 1 and C’ — p* degrees of freedom. 


Part Correlation 


Part — Corr(X,) = oe 


Partial Correlation 
r y k 


Partial — Corr( Xx) = 
VTRRT yy — TykT ky 
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Statistics for Variables Not in the Equation 


The following statistics are computed for each variable not in the equation. 


Standardized regression coefficient Beta if predictor enters the equation at the next 
step 


x ruk 
Beta, = y 


"kk 
The F-test for Beta; 


(C'— p* — Urn 


Pap 2 
Tek yy — Top 


— 


with 1 and C’ — p* degrees of freedom 


Partial Correlation 


Pyk 


Pa rtial( X;) = Vruytkk 


Tolerance 


Tolerance, = rp 


Minimum tolerance among variables already in the equation if predictor enters at the 
next step is 


i 
nun Th 
leis? ( — (ragr in) /Pee 


Residuals and Associated Statistics 


There are 19 temporary variables that can be added to the active system file. These variables can 
be requested with the RESIDUAL subcommand. 


Centered Leverage Values 


For all cases, compute 


= X;) ) (Xe — Xk) rjr 


Pp Pp x 
X ji 
oy Fe V 555 Skk 


if intercept is included 
hy = 


Pp 
end eee herw 
Cy otherwise 
Sij Skk 


j=l k=1 


For selected cases, leverage is /.,; for unselected case i with positive caseweight, leverage is 


eee ee [Ge + hi) /(1 + ge + hi) — 7H] _ if intercept is included 
hy] (1 + hg /g:) otherwise 


Unstandardized Predicted Values 


> bX ki if no intercept 
Y; = , 
bo 4 So bX bi otherwise 
k=1 


Unstandardized Residuals 


ee =V%i-Y; 
Standardized Residuals 


if no regression weight is specified 


ZRESID; = ié SYSMIS _ otherwise 


where s is the square root of the residual mean square. 


Standardized Predicted Values 
ZPRED; = ie ee if no regression weight is specified 
sYSMIS otherwise 
where sd is computed as 
. er? 
s ci(Y E Y) 


C-1 


sd = 
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Studentized Residuals 


Ja-kya for selected cases with c; > 0 
1—h; )/¢ 
SRES, = ¢VU-M/9 . 
——— otherwise 
(1+h; ) g 


Deleted Residuals 


ei (1 — hi) for selected cases with c; > 0 


DRESID; = 
Cj otherwise 


Studentized Deleted Residuals 


PRES ID for selected cases with c; > 0 
SDRESID;= 4 __“e __ otherwise 
(1+hi)/9: 


where 


DRESID? 


. — B*) 
S;/: 
(i) ~ asa" 1—h; 


Adjusted Predicted Values 


ADJPRED; = Y; — DRESID; 


DfBeta 


giei(X' WX) Xt 
1 ~~ h; 


DFBETA, =b— b(i) = 


where 


x! - ite. cree Xi) if intercept is included 
a PAA tgyesapeXga) otherwise 


and W = diag(wi,..., w). 


This is only computed for selected cases with case weight greater than or equal to 1. 
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Standardized DfBeta 
bj — b;(7) 
8(;)\/ (X'WX) 


SDBET Aj; = 
ii 


where b; _ b, (i) is the jth component of }, — p(;), and 


This is only computed for selected cases with case weight greater than or equal to 1. 


DfFit 
: hie; 
DF FIT, = X4|b= b(i)| = ~ 
1 = Ia 
This is only computed for selected cases with case weight greater than or equal to 1. 
Standardized DfFit 
DFFIT; 
SDFIT, = ——* 
S\ i) V hj 
This is only computed for selected cases with case weight greater than or equal to 1. 
Covratio 


er: S(j)\2P" 1 
COV RATIO, = ( ) ——s 
§ 1 —h; 


This is only computed for selected cases with case weight greater than or equal to 1. 


Mahalanobis Distance 


For selected cases with c; > 0 


VWATLAT. (C —1)h;_ if intercept is included 
care aa 1S otherwise 


For unselected cases with c; > 0 


MAHAL; = Ch; if intercept is included 
: ae (C+1)h; otherwise 
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Cook’s Distance (Cook, 1977) 


For selected cases with c; > 0 
ean (DRESIDFhisi) / [st p+1)| if intercept is included 
(DRESID?higi)/(s7p) otherwise 


For unselected cases with c; > 0 


COOK, = (DRESID?(h; + ¢-))/[5?(p+1)] _ if intercept is included 
= oA. ~? 
(sp 


(DRESID?h ;) /(%p) otherwise 


where /: ; is the leverage for unselected case i, and s° is computed as 


u 


‘ ASS. +¢?(1—hi-4p)] _ if intercept is included 
coor [SSe + e7 (1 — hs) otherwise 


Standard Errors of the Mean Predicted Values 


For all the cases with positive caseweight, 


SEPRED; = S hilo: if intercept is included 


s\/hj/g; otherwise 
95% Confidence Interval for Mean Predicted Response 
LMCIN; = Y; — to.975,C-p- SEPRED; 
UMCIN; = Y; + to.975,c-p- SEPRED; 


95% Confidence Interval for a Single Observation 


LICIN; = | Vj t0.975,C—p"§ (hi ! 1)/9 if intercept is included 
Y; — to.975,.c—p8V (hi+1)/9i — otherwise 


UICIN, = Y; + to.975,C—p-$ V (hi + 1)/9 if intercept is included 
Y; + to.975.c-p8\V(hi+1)/9i otherwise 
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Durbin-Watson Statistic 


where €; = €;\/i- 


Note: the Durbin-Watson statistic cannot be computed if there are fractional case weights. Even 
with integer case weights, the formula is only valid if the case weights represent contiguous case 
replications in the original sample. 


Partial Residual Plots 


The scatterplots of the residuals of the dependent variable and an independent variable when 
both of these variables are regressed on the rest of the independent variables can be requested 
in the RESIDUAL branch. The algorithm for these residuals is described in (Velleman and 
Welsch, 1981). 


Missing Values 


By default, a case that has a missing value for any variable is deleted from the computation of the 
correlation matrix on which all consequent computations are based. Users are allowed to change 
the treatment of cases with missing values. 
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RELIABILITY Algorithms 


The RELIABILITY procedure employs one of two different computing methods, depending upon 
the MODEL specification and options and statistics requested. 


Method 1 does not involve computing a covariance matrix. It is faster than method 2 and, for large 
problems, requires much less workspace. However, it can compute coefficients only for ALPHA 
and SPLIT models, and it does not allow computation of a number of optional statistics, nor does 
it allow matrix input or output. Method 1 is used only when alpha or split models are requested 
and only FRIEDMAN, COCHRAN, DESCRIPTIVES, SCALE, and/or ANOVA are specified on 
the STATISTICS subcommand and/or TOTAL is specified on the SUMMARY subcommand. 


Method 2 requires computing a covariance matrix of the variables. It is slower than method 1 and 
requires more space. However, it can process all models, statistics, and options. 


The two methods differ in one other important respect. Method 1 will continue processing a scale 
containing variables with zero variance and leave them in the scale. Method 2 will delete variables 
with zero variance and continue processing if at least two variables remain in the scale. If item 
deletion is required, method 2 can be selected by requesting the covariance method. 


Notation 


There are N persons taking a test that consists of k items. A score X;; is given to the jth person 
on the ith item. 


1 2 bee i oe k 
1) Xiu X12 Xik Py 
J Xji P; 
N/|Xni Xn2 XNk Pn 
Ti Ts a T, 2 T;, G 


If the model is SPLIT, /:; items are in part 1 and/t2 =k — ky are in part 2. If the number of 
items in each part is not specified and k is even, the program sets /:) = hy =/2. Ifk is odd, 
ky = (k + 1)/2 It is assumed that the first &, items are in part 1. 

Table 88-1 

Notation 


Notation Description 
N Sum of the weights, where w, is the weight for case j 
W=Sow; 
—_/ . 
j=l 
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Notation Description 
k The total score of the jth person 
R= 0x3 
i=l 
Pj = P;/k Mean of the observations for the jth person 
N The total score for the ith item 
Le y XjiWj 


Grand sum of the scores 


G= no Grand mean of the observations 


Scale and Item Statistics—Method 1 


Item Means and Standard Deviations 


Mean for the ith Item 


T; =T;/W 


Standard Deviation for the ith Item 


N 


a) 74 
So wi X5, -— WT; 


j=l 


Ses : 
W-1 


Scale Mean and Scale Variance 
Scale Mean 


M=G/W 


For the split model: 


Mean Part 1 


hy 
M =) 7; 
i=1 


Mean Part 2 
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M,= > T; 


i=h,+1 


Scale Variance 


For the split model: 


Variance Part 1 


N k k 2 
: 1 1 : ; 1 _ 
= pg) doe(Sae)-w(-7} 
j=l i=l i=1 
Variance Part 2 
2 2 
1 N k ko 
Spo = W-1 De J De ee Ds fy 
j=1 i=k,+1 i=k,+1 
Item-Total Statistics 


Scale Mean if the ith Item is Deleted 


M=MST: 


Scale Variance if the ith Item is Deleted 
S? = So + S? — 2cou( Xj, P) 


where the covariance between item i and the case score is 


; 1 [oy =, 
coUNa, F) = pe P)Xjiwj — DTT 
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Alpha if the ith Item Deleted 


k 
Aj = 1- 2 57/5. 


Lfi 


Correlation between the ith Item and Sum of Others 
cov( X;, P) — Ss? 
5,8; 


R= 


The ANOVA Table (Winer, 1971) 


Table 88-2 
ANOVA table 
Source of Sum of Squares df 
variation 
Bet = 
etween eet, 12 /1a7 
P?w;/k —G?/Wk = 
people 2 SEES w-1 
Withi “. 
ithin 2 32 
people D4 dwt oF ny (k — 1) 
Bet ‘ 
etween 2 
—G /\ asee 
measures pS Ti /W —G"/Wk k—1 
k N N 
Residual Sow; XF = 2 Fp Kr) W -G?/Wk (W —1)(k—1) 
t=1 j=1 
k N ; 
Total S232 wi X7, — G?/Wh Wk-1 
w=1 yl 


Each of the mean squares is obtained by dividing the sum of squares by the corresponding degrees 
of freedom. The F ratio for between measures is 


= “between measures df =(k —1,(W —1)(k —1)) 
“residual 
Friedman Test or Cochran Test 


> ss ; 
= petween measures df=k-—1 
within people 


Note: Data must be ranks for the Friedman test and a dichotomy for the Cochran test. 


Kendall’s Coefficient of Concordance 


W = S Sbetween measures 
SStotal 


(Will not be printed if Cochran is also specified.) 


Tukey’s Test for Nonadditivity 


The residual sums of squares are further subdivided to 


SSponadd = °/D, df =1 


where 
f= (s bet. meas? “bet. people) /(W) 
k we 
cd T\" Dp A\ 2 
7 bs vaeG) | > ¥; (Pj -@) 
= j=l 
ko ON 7 
Mi Dt oy re i/k — GS Sbet, meas 
i=l j=l 
N i 
=) 0j(P;-@) bs Xji(Ti— a) 
j=l i=] 


SSphal = SSres — SSyonada> af = (W —1)(k-1) - 


The test for nonadditivity is 


MS 
= ,,nonadd gf — (1,(W -1)(k—1)- 


MSbalance 


The regression coefficient for the nonadditivity term is 
B= MD; 
and the power to transform to additivity is 


p=1-BG 


Scale Statistics 


Reliability coefficient alpha (Cronbach 1951) 
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For Split Model Only 


Correlation Between the Two Parts of the Test 


1 avs an q2 
2 (s? a So = 53) 


R= 


Equal Length Spearman-Brown Coefficient 


2R 
1+R 


Y= 


Guttman Split Half 


Unequal Length Spearman-Brown 


—R? 4+ \/R*+4R2(1 — BR?) hy ko /k? 


ILY = Tr 
: 2(1 ae R?)kyko/k? 
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Basic Computations—Method 2 


Items with zero variance are deleted from the scale and from k,/,, and /2. The inverses of 
matrices, when needed, are computed using the sweep operator described by Dempster (1969). If 
\V| < 10~°°, a warning is printed and statistics that require V~! are skipped. 


Covariance Matrix V and Correlation Matrix R 


N 
CT; = (ws (>. Xi X1; Wi WT. T; 


l=1 


5 k a k k 
B=) SHED oy, 
i=] J 


ky k, k 
2 a LAL 
sta Sosey 2d ony 
i=1 @<Jj 
k k k 
Sa= DD SE+2 DT D vy 
=k, +1 i=ki+l g>2 


where the first /:; items are in part 1. 


Scale Statistics—Method 2 


Alpha Model 


Estimated Reliability 


a= 1,....4 ] if raw data input 


if correlation matrix and SD input 
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Standardized Item Alpha 
kCorr 


1+ (k —1)Corr 


Split Model 


Correlation between Forms 


Guttman Split-Half 


ky k 
YY ay 
i=1 j=k,4+1 
5 
Os 


Alpha and Spearman-Brown equal and unequal length are computed as in method 1. 


Guttman Model (Guttman 1945) 


Ly — 1 7 a 
Sp 
|e EK 
‘ia ee v2 
Ly = Ly + i 
¥ 
k 
Ls = ly 
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kk , 
4D iy 
op Se 
i= 5 
Op 
k 
2 
2, | max; ) Vij 
jFi 
Ls = Ly op 
Dp 
x 1 
=, 2 /Q2, a r—1\— 
Ig =1- iy, S2; where e? = (V~1).. 
i=1 


Parallel Model (Kristof 1963) 


Common Variance 


1 
CV =var = ae S? 
“G=1 
True Variance 
2 kk 
TV =tov= pap Vij 
k(k = 1) 4 <j 


Common Inter-Item Correlation 
R= cou / var 


Reliability of the Scale 
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Unbiased Estimate of the Reliability 


4 — 2+ (W-3)A 
“— (W=1) 


where A is defined above. 


Test for Goodness of Fit 


y=-(W-1)] 1 


where 
= IV 
Le (var—cov)*—'(var+(k-+1)cov) 
__ k(k+1) 
lf = a 2 


Log of the Determinant of the Unconstrained Matrix 


log UC = log |V | 


Log of the Determinant of the Constrained Matrix 


log C = log ((var _ cou)! (car +(k- 1)z0°) ) 


Strict Parallel (Kristof 1963) 


Common Variance 


Error Variance 


EV = MS\ithin people 


All mean squares are calculated as described in the analysis of variance table. 
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True Variance 


Unbiased Estimate of the Reliability 


3+(W — 3)Rel 


Rel = 
te ii 


Test for Goodness of Fit 


va or-n(1 D(k—1 


where 


WY 
pare 
cn 
pe 
ot 

t 
oe) 
— 
oe 
No 

| 


Vi 


L= 


k 
(var + (k — 1)cov) (ve — cov + i (T; — a 
=1 


i 


df =k(k +3)/2—3 


Log of the Determinant of the Unconstrained Matrix 


log UC = log 


a 
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Log of the Determinant of the Constrained Matrix 


k 


1 —_ 
log C = log (var + (k — pc (ear — 0 gue pan eee 
: 1 


t 


Additional Statistics—Method 2 


Descriptive and scale statistics and Tukey’s test are calculated as in method 1. Multiple R? if 


an item is deleted is calculated as 


2 Gj Oe A 
Re =1— gs 7 = 10", 


Analysis of Variance 


Table 
Table 88-3 
Analysis of variance table 
Source of | Sum of Squares df 
variation 
k kk 
Between . San 2 ; 
people (VV - |i xx 9; - EE) t eS vis W-1 
i =] < i<j 
Withi k 2 kk 
ithin W—1)(k—-1) 42 4 , nied : 
people k Ly ei a - + (W = 1)5Sbet, people W(k—1) 
i=1 t<j 
- k 1(& a 
etween > met al = 
measures m me a k (>. : ) ie 
W-1)(k-1 2 ie , 
Residual eo Db sf Al “ (W — 1)(k — 1) 
i=l a< j 
Total Between SS + Within SS Wk-1 


Hotelling’s T-Squared (Winer, 1971) 


T?-WYB'Y 


where 
T, —Tr 
y- T,—Ty, 
Tr-1 = Th 
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where C is an identity matrix of rank & — 1 augmented with a column of —1 on the right. 


12 
big = Vig — Vik — Ujk + Sp 
The test will not be done if W < k or |B| < 107°”. 
The significance level of T? is based on 


a ee es a 
F = wpontey!”, withdf = (k-1,W—k+1) 


Item Mean Summaries 


—_ k 
= (7) /k 
Variance=!_! Ck a 


Maximum = max 7’; 
i 

Minimum = min 7; 
u 


Range = Maximum — Minimum 


: Maximum 
Ratio = ———_——__ 
Minimum 


Item Variance Summaries 


Same as for item means except that S? is substituted for 7’; in all calculations. 


Inter-ltem Covariance Summaries 


t<j 
M = 
ean ik 1) 
1 kk 1 kk a 
Variance=———_——__ | ©=| ©. v, ; —- ———— | © dD y;; 
k(k-—1)-1 i<j J k(k -—1) ee J 


Maximum = inax vj; 
“J 
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Minimum = iin v;; 
wp 
Range = Maximum — Minimum 


; Maximum 
Ratio = ———_——__ 
Minimum 


Inter-Item Correlations 
Same as for inter-item covariances, with v;; being replaced by r;;. 


If the model is split, statistics are also calculated separately for each scale. 


Intraclass Correlation Coefficients 


Intraclass correlation coefficients are always discussed in a random/mixed effects model setting. 
McGraw and Wong (1996) is the key reference for this document. See also Shrout and Fleiss 
(1979). 


In this document, two measures of correlation are given for each type under each model: single 
measure and average measure. Single measure applies to single measurements, for example, the 
ratings of judges, individual item scores, or the body weights of individuals, whereas average 
measure applies to average measurements, for example, the average rating for k judges, or the 
average score for a k-item test. 


One-Way Random Effects Model: People Effect Random 


Let Xj; be the response to the ith measure given by the jth person, i= 1, ...,k,j=1, ..., W. Suppose 
that Xj can be expressed as Xjj= j: + pj + wji, where pj is the between-people effect which is 
normal distributed with zero mean and a variance of or and wjj is the within-people effect which 
is also normal distributed with zero mean and a variance of o~. 


Let \/ Spp and | Swp be the respective between-people Mean Squares and within-people Mean 
Squares. These two quantities can be computed by dividing the corresponding Sum of Squares 
with its degrees of freedom. For more information, see the topic “Analysis of Variance Table”. 


Single Measure Intraclass Correlation 
The single measure intraclass correlation is defined as 
Pl) = 79 


On + Oy 


Estimate 
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The single measure intraclass correlation coefficient is estimated by 


S: —MS: 
ICC (1) = “SBP wp 


In general, 
—1 is ry _ 


Confidence Interval 


For 0 < a <1, a (1- q)100% confidence interval for p;;) is given by 


Py jw Fo /2,W -1,W(k-1) Spas eS Fojw—Fi-os2. -1. (4-1) 
Fy jwt(k—-1) Fo /2,w—1,w(k-1) > PO) S Py /wt(k-1) Py a2. 1. 6-1)? 
where 
MSpp 
pw = 7G 
Lf OWP 


and f°, .,, y, is the upper a point of a F-distribution with degrees of freedom vy and vo. 


am U1, 
Hypothesis Testing 
The test statistic F\') for Hp : pi;) = po, Where 1 > po > 0 is the hypothesized value, is 
(1). 2-7 l—po 
I = Fp wy (F-l)po" 


Under the null hypothesis, the test statistic has an F-distribution with WW — 1, W (k — 1) degrees of 
freedom. 


Average Measure Intraclass Correlation 


The average measure intraclass correlation is defined as 


Estimate 


The average measure intraclass correlation coefficient is estimated by 


MS. MS. 


ICC (k) = —PEe 
Confidence Interval 


A (1- a)100% confidence interval for p,;,) is given by 


Py /w—Ha/2,W-1,W(k-1) - FP p/w—Fi-as2.w—1,W (k-1) 
S P(k) S F : 


p/w p/w 


Hypothesis Testing 
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The test statistic F") for Hy : p(., = po, where 1 > py > 0 is the hypothesized value, is 


F(k) = F wi 1 aa PO ). 
Under the null hypothesis, the test statistic has an F-distribution with W — 1, W (k — 1) degrees of 
freedom. 


Two-Way Random Effects Model: People and Measures Effects Random 


Let Xj be the response to the i-th measure given by the j-th person, i= 1, ...,k,j=1,..., W. 
Suppose that X;j can be expressed as Xjj = ju + pj + mj + pmj + jis where pj is the people effect 
which is normal distributed with zero mean and a variance of 7 rs mj is the measures effect which 
is normal distributed with zero mean and a variance of o>» PM is the interaction effect which 
is normal distributed with zero mean and a variance of C2. and ej; is the error effect which is 
again normal distributed with zero mean and a variance of a2: 


Let MJ Spp, Spy and /Spaog be the respective between-people Mean Squares, 
between-measures Mean Squares and Residual Mean Squares. These quantities can be computed 
by dividing the corresponding Sum of Squares with its degrees of freedom. For more information, 
see the topic “Analysis of Variance Table”. 


Type A Single Measure Intraclass Correlation 
The type A single measure intraclass correlation is defined as 


if interaction effect pm ;; 1S present 
P(A\Lr) = : . 
——F if interaction effect pm j; is absent 
Estimate 


The type A single measure intraclass correlation coefficient is estimated by 


MSpp-MS 
ICC (A,1,r) = BP" __"Res 
MSppt(k-1) MSpoct k(MSpyyq-MS) VW 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 
Confidence Interval 


A (1- a)100% confidence interval is given by 


W(MSpp F./2 w-1.v'MSp ac) 
k-MSpyy+(kW—k W)MSp ac| +W-MSpp 


W(MSpp-Fi a/a.w_10MS ) 
; /2.w J|keMS- ei(kW—k—-W)MS SS 7 Me 
Fy a/2,.w 1» |» MSpyt+(kW-k-W)MSp ..]+W-MSpp 


< P(A,1,r) 


Faj2,w—1,v 


< 


where 
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(oMSpyqi bMSpes) 


y= 3 . 
aMS ) (pars )o 
(\"SBM) , (Res 

k—1 W-1)(k-1) 
_ keICC(A.L,r) 
@= WO-ICC(A,Lr)) 
and 


k-ICC(A,1,r)-(W=1) 
b=14 W(1-ICC(A.1,r)) ° 


Hypothesis Testing 
The test statistic F'4:!") for Ho : p(.4.1..) = po , Where1 > py > 0 is the hypothesized value, is 
pAb) — MSpp 
ao. SBM + 001 SRes 

where 
a= kpo 

0= = 

VW (1 = po) 

and 


b = 4 kpo(W—1) 
D9 = 1 W(1—pao) * 


Under the null hypothesis, the test statistic has an F-distribution with W — 1, i degrees of 


freedom. 


7 . 2 
mo _ («MSpyy+hoM Sp.) 


Type A Average Measure Intraclass Correlation 


The type A average measure intraclass correlation is defined as 


if interaction effect pm ;; is present 


if interaction effect pm; is absent 
Estimate 
The type A average measure intraclass correlation coefficient is estimated by 


ICC (A,k,r) = MSppa't*Res 
a MSppt(/Spyq-™S) " 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 
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Confidence Interval 


A (1- a)100% confidence interval is given by 


W (MSpp-Faj2w-1.0 41 Spas) 


< P(A.k.r) 


Fu 2,w-1e(MSpy-/ Spas) +W-MSpp 
W (MSpp-Fi-a 2-10 Spas) 


| Fi -2.0-1»(MSBy-“Speg)+W-MSpp 
where 


(cMSpyyqi dMSpeg) 


b= 2 7 
(: SBM) . (5p ag) 
k—-1 (W—1)(k-1) 
__ IGC(A,kr) 
C= WO-ICC(A.)) 
and 


S ICC(A,k,r)-(W—1) 
d=1+ WU—ICC(A,F,r)) * 


Hypothesis Testing 
The test statistic for Ho : p(a.4.-) = po » Wherel > po > 0 is the hypothesized value, is 
pike) MS pp 
co Spm + do SRes 

where 
poe PO 
C0 = 

1m (1 = po) 
and 


_ 4, po(W=1) 
do =1+ Wap 


Under the null hypothesis, the test statistic has an F-distribution with W — 1, Ni degrees of 
freedom. 


(k) _ (coMSpyq+doMSp a.) 

(<o MSpy4) . (4omsp ) 
; es 

———k=-T TW 


Y% 


2 


/—4)(k—1) 


Type C Single Measure Intraclass Correlation 


The type C single measure intraclass correlation is defined as 
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—_——, _ ifinteraction effect pm ,; is present 
P(CA,r) = ptopmt% : 


if interaction effect pm j; is absent 


Estimate 


The type C single measure intraclass correlation coefficient is estimated by 


ah a 12 MSpp-Ms. 
ICC (C,1,r) = Msgpett 1 Boe: 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 
Confidence Interval 


A (1- 0)100% confidence interval is given by 


Py (ra a /2,W -1,(W-1)(k-1) e Py ira i—a/2,Ww-1,(W—-1)(k-1) 


Fy jp +(k-1) Fo /2,w -1,(W-1)(k-1 S P(CAr) S Py jp +(k-1) Fy a/2,w-1,(W-1)(k-1 
where 
MS 
famed 1a 
Res 
Hypothesis Testing 
The test statistic for Hy : pic...) = po, Where 1 > py > 0 is the hypothesized value, is 


FC.) = BF 1—po 


"p/r To(k—1 po 


Under the null hypothesis, the test statistic has an F-distribution with 
W-1,(W-I1)(k 1) degrees of freedom. 


Type C Average Measure Intraclass Correlation 


The type C average measure intraclass correlation is defined as 


—_—_» —__ if interaction effect pm; is present 
ao +(a2, +o2)/k . 
P(C.kr) = o2 “Cf: : : . 
ar if interaction effect pm; is absent 
Taz, ; 
Estimate 


The type C average measure intraclass correlation coefficient is estimated by 


MS. —MS. 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 


ICC (C,kyr) = 


Confidence Interval 


A (1- 0)100% confidence interval is given by 
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Projet FP o/2,.W-1(W-1(k-1) a eo Foire Fi-0/2.W-1,(W-1)(k-1) 
Pope ~ P(C kyr) ™ Fot+ . 
Hypothesis Testing 
The test statistic for Ho : pyc.4,.») = po Where 1 > po > 0 is the hypothesized value, is 


FIG) = F,7, (1 — po). 


Under the null hypothesis, the test statistic has an F-distribution with 
W-1,(W-I1)(k 1) degrees of freedom. 


Two-Way Mixed Effects Model: People Effects Random, Measures Effects Fixed 


Let Xjj be the response to the i-th measure given by the j-th person, i=1,...,k,j=1,..., W. 
Suppose that Xjj can be expressed as Xjj= ju + pj + mj + pmjj + eji, where pj is the people effect 
which is normal distributed with zero mean and a variance of ge m; is considered as a fixed effect, 
pmj,j is the interaction effect which is normal distributed with zero mean and a variance of 
and ejj is the error effect which is again normal distributed with zero mean and a variance of a2. 
Denote #7, as the expected measure square of between measures effect mj. / 


m 


Let \/Spp and | Spe be the respective between-people Mean Squares and Residual Mean 
Squares. These quantities can be computed by dividing the corresponding Sum of Squares withits 
degrees of freedom. For more information, see the topic “Analysis of Variance Table”. 


Type A Single Measure Intraclass correlation 


The type A single measure intraclass correlation is defined as 


7 ~F pm! (k-1) 


if interaction effect pm; is present 
P(A,1,m) = - 2 : 

Por if interaction effect pm; is absent 
Estimate 


The type A single measure intraclass correlation is estimated by 


MSpp-MS 
ICC (A,1,m) = : BP | Res : 
MSpp+(k-l) MSpag+k(MSpyq-M Spac yn 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 


Confidence Interval 


A (1- @)100% confidence interval for p(4.1,m) is the same as that for pc4,,). with 
ICC (A,1,r) replaced by ICC (A,1,m). 


Hypothesis Testing 


The test statistic for Ho : pic) = po , Wheré > po > 0 is the hypothesized value, is the same 
as that for p;4.;.,), with the same distribution under the null hypothesis. 
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Type A Average Measure Intraclass Correlation 


The type A average measure intraclass correlation is defined as 


2g? /(k—-1 apes . : 
: "(Cee 7 if interaction effect pm ;; is present 
o-+(@2 +02 404 )/k = 
2402, +089, +3 


P(A,kym) = 7 
if interaction effect pm ;; is absent 


Estimate 


The type A single measure intraclass correlation is estimated by 


Not estimable if interaction effect pm; is present 


ICC (A, k,m) = 7 BP Res if interaction effect pm; is absent ° 
MSpp* (“Spyi-™Spes )/" 


Confidence Interval 


A (1- a)100% confidence interval for p,,;,.,,,) isthe same as that for p;, ;,.), with 
ICC (A,k,r) replaced by /C’C (A, k,m). Notice that the hypothesis test is not available when 
the interaction effect pmjj is present. 


Hypothesis Testing 


The test statistic for Ho : p¢.4.h.m) = Po, Where 1 > po > 0 is the hypothesized value, is the same 
as that for p;4.;,.,), With the same distribution under the null hypothesis. Notice thatthe hypothesis 
test is not available when the interaction effect pmjj is present. 


Type C Single Measure Intraclass Correlation 
The type C single measure intraclass correlation is defined as 
if interaction effect pm ;; is present 


P(CA,m) = ‘ 
if interaction effect pm; is absent 


Estimate 


The type C single measure intraclass correlation is estimated by 


ICC(C.1m)= “Between people "Residual 
~~ ™SBetween people” ‘*~!)/ Residual’ 


Notice that the same estimator is used whether or not the interaction effect pmjj is present. 
Confidence Interval 


A (1- o)100% confidence interval is given by 


Fp /r—Fa/2,Ww —1,(W—1)(k—1) < ; < Fojr—Fi_a/2,w—1,(W—1)(k—1) 
Fy jr +(k—1) Fa /2,w—1,(W—1)(k-1) P(C,1,m) Py jr +(k—-1) Fi—a/2,.w—1,(W—1)(k-1) " 


where 
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Roa BES 
™®Res 
Hypothesis Testing 
The test statistic for Ho : p(c-4.m; = po, Where 1 > po > 0 is the hypothesized value, is 
FiGAm) —-Fj. l—po 


P/T |+(k—1L)po° 


Under the null hypothesis, the test statistic has an F-distribution with 
W—1,(W —1)(k— 1) degrees of freedom. 


Type C Average Measure Intraclass Correlation 
The type C average measure intraclass correlation is defined as 


if interaction effect pm ;; is present 
P(C\kym) = 


if interaction effect pm; is absent 
Estimate 


The type C average measure intraclass correlation is estimated by 


Not estimable _ if interaction effect pm ;; 1S present 


ICC (C,k,m) = MSBP -™SRes if interaction effect pm; is absent ° 
MSpp Jt 


Confidence Interval 


A (1- o)100% confidence interval is given by 


Eales Baya Ws ASS CF pjr—Fi-o/2,W=1,(W=1)(k=1) 
Mp/r < P(Cykym) S F, : 


Notice that the confidence interval is not available when the interaction effect pmjj is present. 
Hypothesis Testing 
The test statistic for Ho : p¢c:.4.n) = Po, Where 1 > py > 0 is the hypothesized value, is 


BAG Aa) Fy) (1 — po). 

Under the null hypothesis, the test statistic has an F-distribution with 
W—1,(W—1)(k— 1) degrees of freedom. Notice that the F-test is not available 
when the interaction effect pmjj is present. 
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Missing values in a time series are estimated. 


Notation 

The following notation is used throughout this section unless otherwise stated: 
Table 89-1 
Notation 

Notation Description 

Kea (Xing Xe) Original series 

x: Estimate for spans 

p Number of spans 

k The number of consecutive missing values 

XtoXj+ 4-1 Set of consecutive missing values 


Methods for Estimating Missing Values 


The following methods are available. 


Linear Interpolation (LINT(X)) 


Reap = § Xia t FA — Xi) 1=0,...,4-1 
SYSMIS i=lorit+k—1l=n 


If k = 1 (that is, only one consecutive missing observation), then 


Ve { 5(Xi-1 XG) 4S 2 ta 1 
SYSMIS i=lori=n 


Mean of p Nearest Preceding and p Subsequent Values (MEAN (X,p)) 
If the number of nonmissing observations in (X),..., X;—,) or (Xj+;,---,Xn) is less than p, 


then set X;,; = SYSMIS; otherwise, set \; ,; = average of p nonmissing observations preceding 
X; and p nonmissing observations following X;,;,—1. 


Median of p Nearest Preceding and p Subsequent Values (MEDIAN (X,p)) 


If the number of nonmissing observations in (Xj,...,X;_1) or (Xisx, ---,Xn) is less than p, then 
set X;,;= SYSMIS; otherwise, set X;_; median of p nonmissing observations preceding X; and 
p nonmissing observations following XY; ;.—1. 
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Series Mean (SMEAN (X)) 


X47 = average of all nonmissing observations in the series. 


Linear Trend (TREND(X)) 


1. Use all the nonmissing observations in the series to fit the regression line of the form 


x, =a bt 


The least squares estimates are 


Apply the regression equation to replace the missing values 


Xia, =a + db(i +1) 
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ROC produces a receiver operating characteristic (ROC) curve. 


Notation and Definitions 


Table 90-1 
Notation 


Notation 


"NEP 
Sensitivity 
Specificity 


Description 


Actual state for case i, it is either positive or negative; positive usually means that a test 
detected some evidence for a condition to exist. 


Test result score for case i. 
Number of true positive decisions 


Number of false negative decisions 
Number of true negative decisions 
Number of false positive decisions 


Probability of correctly identifying a positive 
Probability of correctly identifying a negative 
Cutoff or criterion value on the test result variable 


Number of cases with negative actual state 
Number of cases with positive actual state 


Number of true negative cases with test result equal to j. 
Number of true positive cases with test result greater than j. 


Number of true positive cases with test result equal to j. 


Number of true negative cases with test result less than j. 
The probability that two randomly chosen positive state subjects will both get a more 
positive test result than a randomly chosen negative state subject. 


The probability that one randomly chosen positive state subject will get a more positive 
test result than two randomly chosen negative state subjects. 


Construction of the ROC Curve 


The ROC plot is merely the graph of points defined by sensitivity and (1 — specificity). 
Customarily, sensitivity takes the y axis and (1 — specificity) takes the x axis. 


Computation of Sensitivity and Specificity 


The ROC procedure fixes the set of cutoffs to be the set defined by the values half the distance 
between each successive pair of observed test scores, plus max(x;) + 1 and min(x; ) — 1. 


Given a set of cutoffs, the actual state values, and test result values, one can classify each 
observation into one of TP, FN, TN, and FP according to a classification rule. Then, the 
computation of sensitivity and specificity is immediate from their definitions. 
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Four classification or decision rules are possible: 


Table 90-2 
Classification of decision rules 


ClassificRaetsiounlt 


(1) a test result is positive if the test result value is greater than or equal to C’ and that a test result is 
negative if the test result is less than C; 


(2) a test result is positive if the test result value is greater than C' and that a test result is negative if 
the test result is less than or equal to C; 


(3) a test result is positive if the test result value is less than or equal to C’ and that a test result is 
negative if the test result is greater than C’; and 


(4) a test result is positive if the test result value is less than C’ and that a test result is negative if 
the test result is greater than or equal to C. 


Specificity 
Specificity is defined by 


NTN 
NTN + NFP 


Sensitivity 
Sensitivity is defined by 


NTP 
NTP + EN 


Interpolation of the Points 


When the test result variable is a scale variable, the number of distinct test result values and 
thus the number of cutoff points tend to increase as the number of observations (or test results) 
increases. Theoretically, in the “limit” the pairs of sensitivity and (1 — specificity) values form 
a dense set of points in itself and in some continuous curve, the ROC curve. A continuous 
interpolation of the points may be reasonable in this sense. 


Note: The domain of the test result variable need only be a positive-measure subset of the real 
line. For example, it could be defined only on (-1, 0] and (1, +0<). As long as the variable is not 
discrete, the ROC curve will be continuous. 


When the test result variable is an ordinal discrete variable, the points never become dense, 

even when there are countably infinite number of (ordinal discrete) values. Thus, a continuous 
interpolation may not be justifiable. But, when it is reasonable to assume there is some underlying 
or latent continuous variable, an interpolation such as a linear interpolation, though imprecise, 
may be attempted. From now on, the test result variable is assumed continuous or practically so. 


The problem is related to having ties, but not the same. In the continuous case, when values are 
tied, they are identical but unique. In the ordinal case with the grouped/discretized continuous 
interpretation, values in some underlying continuous scale range may be grouped together and 
represented by a certain value, usually the mid range value. Those values are represented as if they 
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were ties, but in fact they are a collection of unordered values. Now, even if each category/group 
contains only one observation, the problem still exists unless the observation’s latent value is 
identical to the representing value of the observation. 


Case 1: No ties between actual positive and actual negative groups 


If there are ties within a group, the vertical/horizontal distance between the points is simply 
multiplied by the number of ties. If not, all the points are uniformly spaced within each of the 
vertical and horizontal directions, because as a cutoff value changes, only one observation at a 
time switches the test result. 


Case 2: Some ties between actual positive and actual negative groups 


For ties between actual positive and actual negative groups, both of the 7p and ppnchange 
simultaneously, and we do not know “the correct path between two adjacent points” (Zweig 
and Campbell, 1993, p. 566). “It could be the minimal path (horizontal first, then vertical) or 
the maximal path (vice versa). The straight diagonal line segment is the average of the two most 
extreme paths and tends to underestimate the plot for diagnostically accurate test” (Zweig and 
Campbell, 1993, p. 566). But, it is our choice here. In passing, the distance and angle of this 
diagonal line depend on the numbers of ties within D+ and D- groups. 


The Area Under the ROC Curve 


Let x represent the scale of the test result variable, with its low values suggesting a negative result 
and the high values a positive result. Denote by x, the x values for cases with positive actual 
states. Similarly, denote by x the zx values for cases with negative actual states. Then, the “true” 
area under the ROC curve is 


6=Pr(r,>a_). 


The nonparametric approximation of 6 is 


W = ae S- s(4,2_), 


allpossibli 


combinations 
of (v4, x_) 


where n., is the sample size of D+, n is the sample size of D-, and 


lif, >a2_ 
s(v1,2_)= $ ifr, =2_. 
0 iffy <a2_ 


Note that 1V is the observed area under the ROC curve, which connects successive points by a 
straight line, i.e., by the trapezoidal rule. 
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An alternative way to compute VW’ is as follows: 


‘i Nn; XN4=; 
VW = ant x {nasi X maps + LR ess b 
x € {set of all test 


result values} 


When a low value of x suggests a positive test result and a high value a negative test result 


If a low value of x suggests a positive test result and a high a negative test result, compute WW as 
above and then 


W' =1-W, 


where | is the estimated area under the curve when a low test result score suggests a positive 
test result. 


SE of the area under the ROC curve statistic, nonparametric assumption 


The standard deviation of WW is estimated by: 


Si (yyy) MOMs eeGiaw esi) 
n4n— 

where 

it n? 
- 2 b=J 
QQ, = Ty naj X nos, tnas;j KX nyeej + — 
2 a ne J t>j >3 J 3 
and 


3 
F no 4 
3 es 
Qo= = y Nn; X |nt=-,4+n_e; Xn_=;4 a 
7 n=n J <j J J 3 


When a low value of x suggests a positive test result and a high value a negative test result 


If we assume that a low value of x suggests a positive test result and a high value a negative test 
result, then we estimate the standard deviation of 1” by SE ( VW ) = SE(W). 


Under the bi-negative exponential distribution assumption, given the number of negative results 
equal number of positive results 


Under the bi-negative exponential distribution assumption when n, = n, 
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SE (WW) is then computed as before. 


When a low value of x suggests a positive test result and a high value a negative test result 


Once again, SE(W’) = SE(W). 


The asymptotic confidence interval of the area under the ROC curve 


A 2-sided asymptotic c% = (100 — a) % confidence interval for the true area under the ROC 
curve is 


W +4 Z,,SE(W). 


When a low value of x suggests a positive test result and a high value a negative test result 


W' +Z.SE (1) 


Asymptotic P-value 


Since W is asymptotically normal under the null hypothesis that 9 = 0.5, we can calculate the 
asymptotic P-value under the null hypothesis that 6 = 0.5 vs. the alternative hypothesis that 
040.5: 
W —0.5 W —0.5 
Pr(|2| a ) ~ 2er(Z >| 
SD (Vi Nlo=0.5 SD (Vi Ne=0.5 


In the nonparametric case, 


SD(W)|o n= 2 A)+(n,-1)(Q, eG) 


Ny n_ 0 0.5 


— /0.5(1—0.5)+(n4 —1)(1/3-0.57) +(n_—1)(1/3—0.57) 
ca ne n— 


= nytnt+l __ n n_(nz+n-+1) ; 
= 1l2n.n_ 12 [tpnh— 5 


because we can deduce that Q; = 1/3 and Qy = 1/3 under the null hypothesis that 6 = 0.5. The 
argument for Q; = 1/3 is as follows. § = 0.5 implies that the distribution of test results of positive 
actual state subjects is identical to the distribution of test results of negative actual state subjects. 
So, the mixture of the two distributions is identical to either one of the distributions. Then, we can 
reinterpret (), as the probability that, given three randomly chosen subjects from the (mixture) 
distribution, the subject with the lowest test result was selected, say, first. (One may consider this 
subject as a negative state subject and the other two as positive state subjects.) From here on, 

we can pursue a purely combinatorial argument, irrespective of the distribution of subjects’ test 
results, because the drawings are independent and given. There are 3! = 3 x 2 x | = 6 ways to 
order the three subjects, and there are two ways in which the subject with the lowest test result 
comes first. So, if @ = 0.5, Q; = 2/6 = 1/3. The argument for Q2 = 1/3 is similar. 


ROC Algorithms 


In the bi-negative exponential case, 


)2 _ >—62 
SD(W) ie as tai (1-0)+(n,—1)(Q:—62)+(n_—1)(Q2—0 fF 


none 


0=0.5 


uta 0=0.5 


5(1—0.5) +(ns —1)(525,—0.5") + (n--1)( 2220.57) 
mene 


ye 5(1—0.5)+(n -—-1)(1/3-0.57) + (n_-—1)(1/3—0.57) 
nan ’ 


where n, =n. (Note that this formula is identical to the nonparametric one except for the 
sample size restriction.) 


When a low value of x suggests a positive test result and a high value a negative test result 


The asymptotic P-value under the null hypothesis that #’ = 0.5 vs. the alternative hypothesis that 
6 #0.5, if desired, may be computed, using 1” and SD (w")| = SD(W) loos 
#=0.5 
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SAMPLE Algorithms 


SAMPLE permanently draws a random sample of cases for processing in all subsequent 
procedures. 


Selection of a Proportion p 


For each case, a random uniform number in the range 0 to 1 is generated. If it is less than p, 
the case is included in the sample. 


Selection of a Sample 


Select a case if its uniform (0,1) number is less than p. If selected, n; = n; — 1, and return to (a). 


Selection of Cases in Nonparametric Procedures 
The sampling procedure is as follows: 


Each time a case is encountered after the limit imposed by the size of the workspace has been 
reached, the program decides whether to include it in the sample or not at random. The probability 
that the new cases will enter the sample is equal to the number of cases that can be held in the 
workspace divided by the number of cases so far encountered. 


If the program decides to accept a case, it then picks at random one of the cases previously stored 
in the workspace and drops it from the analysis, replacing it with the new case. Each case has 
the same probability of being in the sample. 


If case weighting is used, the nonparametric procedures can use a case more than once. For 
example, if the weight of a case is 2.3, the program will use that case twice, and may choose at 
random, with a probability of 0.3, to use it a third time. If sampling is in effect, each of these 
two or three cases is a candidate for sampling. 


SEASON Algorithms 


Model 


Based on the multiplicative or additive model, the SEASON procedure decomposes the existing 
series into three components: trend-cycle, seasonal, and irregular. 


Multiplicative Model 


Mets. p51 


Pacey 
Additive Model 
Ay = TC, } Si f It, $= 1,...,7 


where 7°C’; is the “trend-cycle” component, S; is the “seasonal” component, and J; is the 
“irregular” or “random” component. 


The procedure for estimating the seasonal component is: 


(1) Smooth the series by the moving average method; the moving average series reflects the trend-cycle 
component. 


(2) Obtain the seasonal-irregular component by dividing the original series by the smoothed values 
if the model is multiplicative, or by subtracting the smoothed values from the original series if 
the model is additive. 


(3) Isolate the seasonal component from the seasonal-irregular component by computing the medial 
average (average) of the specific seasonal relatives for each unit of periods if the model is 
multiplicative (additive). 


Moving Average Series 


Based on the specified method and period p, the moving average series 7, for X, is defined 
as follows: 


p is even, weight all points equally 


t+2-1 
Z= S> X;/p t=F+1,....n—3+1 


j=t-% 
SYSMIS otherwise 


p is even, weights unequal 


a (Xz + Xie) /2p+ e? Xj |/p, t= §+1,....n-§ 


SYSMIS otherwise 
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p is odd 
[8] 
a= S> Xj | /p, t= [8] +1,...,.n— 4 
j=t-[f] 
SYSMIS otherwise 


Ratios or Differences (Seasonal-Irregular Component) 


Multiplicative Model 
SL, = SYSMIS, if Z; = SYSMIS 
eet) (X1/Z;) x 100, otherwise 


Additive Model 


SI, = SYSMIS, if 7; = SYSMIS 
eet) X;— Z;, otherwise 


Seasonal Factors (Seasonal Components) 


Multiplicative Model 


medial average(SIi4p, Slt+2p, ---,St+qp); 1<t<L- [Ep 
Fy medial average(SIi+p, Sle+2p, --+sSIt+(q—1)p), L- [£|p <t< [8 
medial average(SI;, SIi+p, ---,STis(q_1)p); [5] <t<p 
where 
L=n-4+1, q= 4) , if p is even and all points are weighted equally 
L=n- [8],  q=[(n—p/2)/p], otherwise 


and the medial average of a series is the mean value of the series after the smallest and the largest 
values are excluded. The seasonal factor is defined as 


SAR =F pe, t=1,...,p 


dF 
t=1 


Additive Model 


F, is defined as the arithmetic average of the series shown above. Then 


SAF, =F; —F, 
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Seasonally Adjusted Series (SAS) 


SAS, = Se / SAF,,)100, if model is multiplicative 
ee AF yn, if model is additive 
where 


m=t—[t/plp 


Smoothed Trend-Cycle Series 


SEASON Algorithms 


The smoothed trend-cycle series (STC) is obtained by applying a 3 x 3 moving average on 


seasonally adjusted series (SAS). Thus, 


STC; = ic AS), 9 + 2(SAS),_1 + 3(SAS), + 2(SAS) p41 + 


=2,---,m—-—2 
and for the two end points on the beginning and end of the series 


(STC), = 4[(SAS), + (SAS)y + (SAS)s] 
(STC), 1 = 4[(SAS),_9 + (SAS)q_1 + (SAS)n] 


(STC), = (STC) + §[(STC), —(STC)s] 
(STC), = (STC), 1 + 4[(STC),_1 — (STC) y_9| 


Irregular Component 


Fort=1,...,n 


ee (SAS),/(STC),, if model is multiplicative 
‘| (SAS), —(STC),, if model is additive 
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Data mining problems often involve hundreds, or even thousands, of variables. As a result, 
the majority of time and effort spent in the model-building process involves examining which 
variables to include in the model. Fitting a computationally intensive model to a set of variables 
this large may require more time than is practical. 

Predictor selection allows the variable set to be reduced in size, creating a more manageable set 
of attributes for modeling. Adding predictor selection to the analytical process has several benefits: 


m™ Simplifies and narrows the scope of the variables essential to building a predictive model. 


m Minimizes the computational time and memory requirements for building a predictive model 
because focus can be directed to a subset of predictors. 


m™ Leads to more accurate and/or more parsimonious models. 


m™ Reduces the time for generating scores because the predictive model is based upon only a 
subset of predictors. 


Screening 


This step removes variables and cases that do not provide useful information for prediction and 
issues warnings about variables that may not be useful. 


The following variables are removed: 
m Variables that have all missing values. 
m™ Variables that have all constant values. 


m™ Variables that represent case ID. 


The following cases are removed: 
m™ Cases that have missing target values. 


m™ Cases that have missing values in all its predictors. 


The following variables are removed based on user settings: 

m™ Variables that have more than m1% missing values. 

m™ Categorical variables that have a single category counting for more than m2% cases. 
™ Continuous variables that have standard deviation < m3%. 
a 


Continuous variables that have a coefficient of variation |CV| < m4%. CV = standard 
deviation / mean. 


™ Categorical variables that have a number of categories greater than m5% of the cases. 


Values m1, m2, m3, ma, and ms are user-controlled parameters. 


Ranking Predictors 


This step considers one predictor at a time to see how well each predictor alone predicts the target 
variable. The predictors are ranked according to a user-specified criterion. Available criteria 
depend on the measurement levels of the target and predictor. 
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Categorical Target 


This section describes ranking of predictors for a categorical target under the following scenarios: 
m@ All predictors categorical 
m= All predictors continuous 


™ Some predictors categorical, some continuous 


All Categorical Predictors 


The following notation applies: 


Table 93-1 
Notation 
Notation Description 
Xx The predictor under consideration with I categories. 
Y Target variable with J categories. 
N Total number of cases. 
Ni The number of cases with X = i and Y= j. 
J 
Ni. The number of cases with X =i. N;. = S° Nij 
j=l 
I 
Nj 


The number of cases with Y= j. N.; = > Ni; 


i=l 


The above notations are based on nonmissing pairs of (X, Y). Hence J, N, and N-j may be 
different for different predictors. 


P Value Based on Pearson’s Chi-square 


Pearson’s chi-square is a test of independence between X and Y that involves the difference 
between the observed and expected frequencies. The expected cell frequencies under the null 
hypothesis of independence are estimated by N;; = N;..N.;/N  . Under the null hypothesis, 


Pearson’s chi-square converges asymptotically to a chi-square distribution \3 with degrees 
of freedom d = (I-1)(J-1). 


The p value based on Pearson’s chi-square X2 is calculated by p value = Prob(,7, > X2), where 


eo ss (Ni; ~ Rij) / Nij. 


J 
i=1 j=l 

Predictors are ranked by the following rules. 
1. Sort the predictors by p value in the ascending order 
2. If ties occur, sort by chi-square in descending order. 


3. If ties still occur, sort by degree of freedom d in ascending order. 


4. 
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If ties still occur, sort by the data file order. 
P Value Based on Likelihood Ratio Chi-square 


The likelihood ratio chi-square is a test of independence between X and Y that involves the ratio 
between the observed and expected frequencies. The eas cell frequencies under the null 
hypothesis of independence are estimated by .V;; = Nj... ,/N. Under the null hypothesis, the 
likelihood ratio chi-square converges asymptotically to a Hen -square distribution \7, with degrees 
of freedom d = (I-1)(J-1). 


The p value based on likelihood ratio chi-square G2 is calculated by p value = Prob( \4> G2), where 


Ed p 
G 2) > 6 *)» with ae um (Nis) NG avs 


i=1 j=l else. 


Predictors are ranked according to the same rules as those for the p value based on Pearson’s 
chi-square. 


Cramer’s V 


Cramer’s V is a measure of association, between 0 and 1, based upon Pearson’s chi-square. It is 
defined as 


2 1/2 
a DP, ex 
nS (santa) " 


Predictors are ranked by the following rules: 
Sort predictors by Cramer’s V in descending order. 
If ties occur, sort by chi-square in descending order. 


If ties still occur, sort by data file order. 
Lambda 


Lambda is a measure of association that reflects the proportional reduction in error when values of 
the independent variable are used to predict values of the dependent variable. A value of 1 means 
that the independent variable perfectly predicts the dependent variable. A value of 0 means that 
the independent variable is no help in predicting the dependent variable. It is computed as 


ee max (jj) — max (N.;) 
: j 
A(Y|X) = + 


N=max,j(N.;) 

Predictors are ranked by the following rules: 
Sort predictors by lambda in descending order. 
If ties occur, sort by I in ascending order. 


If ties still occur, sort by data file order. 
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All Continuous Predictors 


If all predictors are continuous, p values based on the F statistic are used. The idea is to perform a 
one-way ANOVA F test for each continuous predictor; this tests if all the different classes of Y 
have the same mean as X. 


The following notation applies: 


Table 93-2 
Notation 
Notation Description 
N; The number of cases with Y = j. 


vj The sample mean of predictor X for target class Y = j. 


2 


55 


The sample variance of predictor X for target class Y = j. 


Nj 
35 = se (xij —#;)?/(N; —1) 
i=1 


J 
The grand mean of predictor X. 7 — a Njz;/N 
j=l 


The above notations are based on nonmissing pairs of (X, Y). 
P Value Based on the F Statistic 


The p value based on the F statistic is calculated by p value = Prob{F(J—1, N—J)> F}, where 


J 


SENG —%)* (4-1) 


j=! 
Fa 


y (Nj -1 ) 85 /( N-J) 


j=l 


and F(J—1, N—J) is a random variable that follows an F distribution with degrees of freedom J—1 
and N—J. If the denominator for a predictor is zero, set the p value = 0 for the predictor. 


Predictors are ranked by the following rules: 
1. Sort predictors by p value in ascending order. 
2. If ties occur, sort by F in descending order. 
3. If ties still occur, sort by N in descending order. 


4. If ties still occur, sort by the data file order. 
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Mixed Type Predictors 


If some predictors are continuous and some are categorical, the criterion for continuous predictors 
is still the p value based on the F statistic, while the available criteria for categorical predictors are 
restricted to the p value based on Pearson’s chi-square or the p value based on the likelihood ratio 
chi-square. These p values are comparable and therefore can be used to rank the predictors. 


Predictors are ranked by the following rules: 
1. Sort predictors by p value in ascending order. 


2. If ties occur, follow the rules for breaking ties among all categorical and all continuous predictors 
separately, then sort these two groups (categorical predictor group and continuous predictor group) 
by the data file order of their first predictors. 


Continuous Target 


This section describes ranking of predictors for a continuous target under the following scenarios: 
m All predictors categorical 
m= All predictors continuous 


™ Some predictors categorical, some continuous 


All Categorical Predictors 


If all predictors are categorical and the target is continuous, p values based on the F statistic are 
used. The idea is to perform a one-way ANOVA F test for the continuous target using each 
categorical predictor as a factor; this tests if all different classes of X have the same mean as Y. 


The following notation applies: 


Table 93-3 
Notation 
Notation Description 
xX The categorical predictor under consideration with I categories. 
Y The continuous target variable. yjj represents the value of the continuous 
target for the jt case with X = i. 
Ni The number of cases with X = i. 
TF The sample mean of target Y in predictor category X = i. 
s(y)? The sample variance of target Y for predictor category X = i. 
t N; 
s(y)i2= > (vig —5,)°/(Ni - YD 
j=1 
7] The grand mean of target Y. 7 = S/_, Nig, /N 


The above notations are based on nonmissing pairs of (X, Y). 


The p value based on the F statistic is p value = Prob{F(I-1, N—D > F}, where 
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in which F(I-1, N—D) is arandom variable that follows a F distribution with degrees of freedom 
I-1 and N-I. When the denominator of the above formula is zero for a given categorical predictor 
X, set the p value = 0 for that predictor. 


Predictors are ranked by the following rules: 
1. Sort predictors by p value in ascending order. 
2. If ties occur, sort by F in descending order. 
3. If ties still occur, sort by N in descending order. 


4. If ties still occur, sort by the data file order. 


All Continuous Predictors 


If all predictors are continuous and the target is continuous, the p value is based on the asymptotic 
t distribution of a transformation t on the Pearson correlation coefficient r. 


The following notation applies: 


Table 93-4 
Notation 

Notation Description 

x The continuous predictor under consideration. 
Y The continuous target variable. 

g=— Ua; /N The sample mean of predictor variable X. 

g= Sy /N The sample mean of target Y. 

s(x)" The sample variance of predictor variable X. 
s(y)° The sample variance of target variable Y. 


The above notations are based on nonmissing pairs of (X, Y). 


The Pearson correlation coefficient r is 


SN (@i-F )(yi-W) /(N-1) 


C= —— 
V str)" s(y)* 


The transformation t on r is given by 
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Under the null hypothesis that the population Pearson correlation coefficient p = 0, the p value 
is calculated as 


p value = 0 : if r°=1, 
2 Prob{T > |t|} else. 


T is arandom variable that follows a t distribution with N—2 degrees of freedom. The p value 
based on the Pearson correlation coefficient is a test of a linear relationship between X and Y. If 
there is some nonlinear relationship between X and Y, the test may fail to catch it. 


Predictors are ranked by the following rules: 
1. Sort predictors by p value in ascending order. 
If ties occur in, sort by r2 in descending order. 


If ties still occur, sort by N in descending order. 


F WwW N 


If ties still occur, sort by the data file order. 


Mixed Type Predictors 


If some predictors are continuous and some are categorical in the dataset, the criterion for 
continuous predictors is still based on the p value from a transformation and that for categorical 
predictors from the F statistic. 


Predictors are ranked by the following rules: 
1. Sort predictors by p value in ascending order. 


2. If ties occur, follow the rules for breaking ties among all categorical and all continuous predictors 
separately, then sort these two groups (categorical predictor group and continuous predictor group) 
by the data file order of their first predictors. 


Selecting Predictors 


If the length of the predictor list has not been prespecified, the following formula provides an 
automatic approach to determine the length of the list. 


Let Lo be the total number of predictors under study. The length of the list L may be determined by 
L= [min (max (30, 2/L) ,Lo)]; 


where [x] is the closest integer of x. The following table illustrates the length L of the list for 
different values of the total number of predictors Lo. 


Lo L L/Lo(%) 
10 10 100.00% 
15 15 100.00% 
20 20 100.00% 
25 25 100.00% 
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a L L/Lo(%) 
30 30 100.00% 
40 30 75.00% 
50 30 60.00% 
60 30 50.00% 
100 30 30.00% 
500 45 9.00% 
1000 63 6.30% 
1500 77 5.13% 
2000 89 4.45% 
5000 141 2.82% 
10,000 200 2.00% 
20,000 283 1.42% 
50,000 447 0.89% 


Simulation algorithms 


Simulation in IBM® SPSS® Statistics refers to simulating input data to predictive models using 
the Monte Carlo method and evaluating the model based on the simulated data. The distribution 
of predicted target values can then be used to evaluate the likelihood of various outcomes. 


The algorithms described here are used by the SIMPLAN and SIMRUN commands. 


Simulation algorithms: create simulation plan 
Creating a simulation plan includes specifying distributions for all inputs to a predictive model 


that are to be simulated. When historical data are present, the distribution that most closely fits the 
data for each input can be determined using the algorithms described in this section. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 94-1 
Notation 

Notation Description 

Li Value of the input variable in the ith case of the historical data 

wy Frequency weight associated with the ith case of the historical data 
Ww Total effective sample size accounting for frequency weights 

obs Sample mean 

Sips Sample variance 

Sobs Sample standard deviation 


Distribution fitting 


The historical data for a given input is denoted by: 


11, £2, .---) Ln 


The total effective sample size is: 


n 


W= 3 Ww; 


i=1 
The observed sample mean, sample variance and sample standard deviation are: 


n 


1 
Lobs — ar es Wirz 
ODS \| 


i=] 
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1 . 2 
s*, SS See ) wil xy — Bohs) 
obs W-1 « re 


; = wo 
Sobs = sobs 


Parameter estimation for most distributions is based on the maximum likelihood (ML) method, 
and closed-form solutions for the parameters exist for many of the distributions. There is no 
closed-form ML solution for the distribution parameters for the following distributions: negative 
binomial, beta, gamma and Weibull. For these distributions, the Newton-Raphson method is used. 
This approach requires the following information: the log-likelihood function, the gradient vector, 
the Hessian matrix, and the initial values for the iterative Newton-Raphson process. 


Discrete distributions 
Distribution fitting is supported for the following discrete distributions: binomial, categorical, 


Poisson and negative binomial. 


Binomial distribution: parameter estimation 


The probability mass function for a random variable x with a binomial distribution is: 
- i N Hb N-2x id . Ha 
Bin (ee NP) = P*(1—P) (ae a \ 


where 0 < P < 1 is the probability of success. The binomial distribution is used to describe 

the total number of successes in a sequence of N independent Bernoulli trials. The parameter 

estimates for the binomial distribution using the method of moments (see Johnson & Kotz (2005) 

for details) are: 

P = LS ze ) obs > obs 
NaN, Lobs < 


oA 
obs 


where NaN implies that the binomial distribution would not be an appropriate distribution to fit 
the data under this criterion, and where 


__ obs 


f 


ma 
| 


If N is not an integer, then the parameter estimates are: 


N+ = [+05 


p* _ Lobs 
N* 
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where [| denotes the integer part of z. 


Categorical distribution: parameter estimation 


The categorical distribution can be considered a special case of the multinomial distribution in 
which N = 1. Suppose x;, i= 1, 2, ...,n, has the categorical distribution and its categorical values 
are denoted as 1, 2, ..., J. Then an indicator variable of x; for category 7 can be denoted as 


= 1 it ay = 4 
“td ~~) 0 otherwise 


and the corresponding probability is P;. Then the probability mass function for a random variable 
x, With the categorical distribution can be described based on x; ; and P; as follows: 


J J 
Coteqomcdite Pies) = II Pe ' with a P;=1 
j=l j=l 
The parameter estimates for P},7 =1,..., J, are: 
n 
> aes 
; — 
P; = aa = 1 eee | 


Poisson distribution: parameter estimation 


The probability mass function for a random variable x with a Poisson distribution is: 


eA \t 


Pois(x;A) = 
zr! 


tare =O es 


where \ > 0 is the rate parameter of the Poisson distribution. The parameter of the Poisson 
distribution can be estimated as: 


A= Lobs 


Negative binomial distribution: parameter estimation 


The distribution fitting component for simulation supports the parameterization of the negative 
binomial distribution that describes the distribution of the number of failures before the 
rh success. For this parameterization, the probability mass function for a random variable z is: 


r—l1 
c 


NB(x;7,0) = & era —6)", forx =0,1,... 
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where r > 0, 0 < 6 < 1 are the two distribution parameters. There is no closed-form solution 
for the parameters r and 0, so the Newton-Raphson method with step-halving will be used. The 
method requires the following information: 


(1) The log likelihood function 


n n n 


L= > wT (aj +r) — Swine)! — Wir (r) + Wrin(6) + In (1 — 8) So wii 
i=l i=1 i=1 


(2) The gradient (1st derivative) vector with respect to r and 0 


OL Wr _1 n Sau 
1 a 7 7 TA) Dain WiFi 
or Dy ,wiv(ai +r) —Wr(r) + Win(8) 


i] 
where w(a) = pe is a digamma function, which is the derivative of the logarithm of the gamma 


function, evaluated at a. 


(3) The Hessian (2nd derivative) matrix with respect to r and 6 (since the Hessian matrix is 
symmetric, only the lower triangular portion is displayed) 


OL NY en DN ee 

WS, Oo ? (1-0)” dial Viti 
PL FL W Lee Tete ee 7 abt (on 
ar a a ge Wil (tit r) —Wy'(r) 


where y’(a) is the trigamma function, or the derivative of the digamma function. 


(4) The initial values of 6 and r can be obtained from the closed-form estimates using the method 


of moments: 
: Devs “£2 ra 
(0) — aoe if sobs 7 Lobs 
1 otherwise 
g(0) = re 
L pl Dot 
Note 


An alternative parameterization of the negative binomial distribution describes the distribution of 
the number of trials before the r'4 success. Although it is not supported in distribution fitting, it is 
supported in simulation when explicitly specified by the user. The probability mass function for 
this parameterization, for a random variable z is: 


NB(2; 1,6) = € a era —0)'", fore >r 


where r > 0, 0 < 6 < 1 are the two distribution parameters. 
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Continuous distributions 


Distribution fitting is supported for the following continuous distributions: triangular, uniform, 
normal, lognormal, exponential, beta, gamma and Weibull. 


Triangular distribution: parameter estimation 


The probability density function for arandom variable x with a triangular distribution is: 


2  («—a) os oT 
(b—a) (m—a)? TE la, m) 
2 papery 
Triag(x;a,m,b) = ea), ate 
2 re) 6 2e (m, | 


0, | x ¢ [a,b] 


such that a < m < b. Parameter estimates of the triangular distribution are: 


@=min {z1,:19,...,: int 
b = max {11,:19,...,: eae 
T= INOUE {85 Ws: 5 353 ak 


Since the calculation of the mode for continuous data may be ambiguous, we transform the 
parameter estimates and use the method of moments as follows (see Kotz and Rene van Dorp 


(2004) for details): 
_ 4 
“ b—a 
m—a 
. =. — 
b—a 
5 fpf pc 
2 W ss Wir 
i=1 


From the method of moments we obtain 
6=32-1 
from which it follows that 


hh = + (6- a) x (37-1) 
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Note: For very skewed data or if the actual mode equals a or b, the estimated mode, 7, may be 
less than 4 or greater than b. In this case, the adjusted mode, defined as below, is used: 


Adj. m = a@ ifm<g@ 
{J.-M =) 5 if h>b 


Uniform distribution: parameter estimation 
The probability density function for arandom variable x with a uniform distribution is: 


1 ies 
U (a;a.b) = { =a? x [a,b] 


0, x ¢ {a,b 


aa eared) 


where a is the minimum and 6 is the maximum among the values of _ . Hence, the parameter 
xv 


estimates of the uniform distribution are: 
G =i {04096 2 es 5% yk 
b= Max ei, Pies, bak 


Normal distribution: parameter estimation 


The probability density function for a random variable x with a normal distribution is: 


S 1 1 c— ft 2 
Nor(z, p.0) = — eS) 99 <2 < 00 
V 270 


Here, ,. is the measure of centrality and o is the measure of dispersion of the normal distribution. 
The parameter estimates of the normal distribution are: 


[tl = Lobs 


Lognormal distribution: parameter estimation 


The lognormal distribution is a probability distribution where the natural logarithm of a random 
variable follows a normal distribution. In other words, if x has a lognormal(j.,) distribution, 
then In(z ) has a normal(In(;), o) distribution. The probability density function for a random 
variable x with a lognormal distribution is: 


1 1 


LN(x, 1.0) = =— e 2 *) 0 <a <00 
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Note: The form of the probability density function is the same as that used in IBM® SPSS® 
Statistics. 


Define /2,,, = ¥ So 1 Wi Ina; 


Li= 


Parameter estimates for the lognormal distribution are: 


n 


; if ; 
c= Tr > wna; — In ji)? 


i=1 


Exponential distribution: parameter estimation 


The probability density function for arandom variable x with an exponential distribution is: 
Exp(x;\) = \e7*", for x > Oand \ > 0 
The estimate of the parameter for the exponential distribution is: 


—— 


Uobs 


Beta distribution: parameter estimation 


The probability density function for a random variable x with a beta distribution is: 


Beta(ax; a, 3) rel Oe ge}, a,b>0 


where, 


There is no closed-form solution for the parameters a and /, so the Newton-Raphson method with 
step-halving will be used. The method requires the following information: 


(1) The log likelihood function 
L =Win(T(a + 8)) — Win(T(a)) — Win(T'(3)) 


n n 


+(a — Ss” wilnaj+(8 — Ss” w;In(1 — 2;) 


i=1 =] 
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(2) The gradient (1st derivative) vector with respect to a and £ 


—_ ba _| Wv(a+6)-Wyla)+ yoy (es) 
OL Wa +8) —W(8) + 3, wiln (1 — 24) 


Op 


I, _ T(a) 
where w(a) = Tra 
function, evaluated at a. 


is a digamma function, which is the derivative of the logarithm of the gamma 


(3) The Hessian (2nd derivative) matrix with respect to a and f (since the Hessian matrix is 
symmetric, only the lower triangular portion is displayed) 


ya_| a . | [WW(a+8)-v"(a)) 
= | eee BE Ww'(a+ 8) W(w(a + 8) — ¥"(8)) 


where w’(a) is the trigamma function, or the derivative of the digamma function. 


(4) The initial values of a and # can be obtained from the closed-form estimates using the method 
of moments: 


0 = Tobs (1 = obs) 
a jee Tats obs _ obs I 
Sobs” 
»(0 > Toby (1 — Lobs 
30) = (1 _ Eobs) ( obs ( : oka} 1) 


Sobs” 


Gamma distribution: parameter estimation 
The probability density function for arandom variable x with a gamma distribution is: 


20 
O 


Gamma(xr;a, 3) = Tia) yee, for c >0 and a,8>0 
Q 


If @ is a positive integer, then the gamma function is given by: T (a) = (a — 1)! 


There is no closed-form solution for the parameters a and /, so the Newton-Raphson method with 
step-halving will be used. The method requires the following information: 


(1) The log likelihood function 
n n 
L=Waln8 —Witl(a) +(a—1) >) wna — 8 Y— wir; 
i=l i=1 


(2) The gradient (1st derivative) vector with respect to a and £ 


dh W In 8 —-Wv(a) + wee w; In x; 
s = aL = Wa nee 
a8 “B Dini Wiki 
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where w(a) = Via} is a digamma function, which is the derivative of the logarithm of the gamma 


a) 


function, evaluated at a. 


(3) The Hessian (2nd derivative) matrix with respect to a and f (since the Hessian matrix is 
symmetric, only the lower triangular portion is displayed) 


at —Ww'(a) 
B= |p, | = W _Wa 
dad5 = OB? 8 3? 


where w)’(a) is the trigamma function, or the derivative of the digamma function. 


(4) The initial values of a and f can be obtained from the closed-form estimates using the method 


of moments: 
— 2 
qQ) Ee (- a 
Sobs 
Q(0) __ obs 
pie a 
sobs 


Weibull distribution: parameter estimation 


Distribution fitting for the Weibull distribution is restricted to the two-parameter Weibull 
distribution, whose probability density function is given by: 


) 


) 


y-1 
es 4 Vi[@u\- (zy Zs 
Weib(az; 8,7) = (5) e(3) | for x >0 and 3,7 >0 
There is no closed-form solution for the parameters / and y, so the Newton-Raphson method with 
step-halving will be used. The method requires the following information: 
(1) The log likelihood function 
n n Yr; 7Y 
L=Wi(ny—vln8 yy — w;In(av;) — aft 
(In9 yln 3) + (4 i) i In (x3) yu (F 
i=1 i=l 
(2) The gradient (1st derivative) vector with respect to f and 
; W > ae, Y 
8 "P+ gi iH) 


ia OL | — Ww a cn . n gy." zt 
Dn oy 7 Win8 + Dija1 wiln (wi) — Sia (#) in (%) 


(3) The Hessian (2nd derivative) matrix with respect to 6 and y (since the Hessian matrix is 
symmetric, only the lower triangular portion is displayed) 
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(4) The initial values of 6 and y are given by: 


40) 4 
BO ~4 


Goodness of fit measures 
Goodness of fit measures are used to determine the distribution that most closely fits the 


data. For discrete distributions, the Chi-Square test is used. For continuous distributions, the 
Anderson-Darling test or the Kolmogorov-Smirnov test is used. 


Discrete distributions 


The Chi-Square goodness of fit test is used for discrete distributions (Dirk P. Kroese, 2011). The 
Chi-Square test statistic has the following form: 


i=] 
where, 
Table 94-2 
Notation 
Notation Description 
k The number of classes, as defined in the table below for eachdiscrete distribution 


O; The total observed frequency for class i 
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Notation Description 


PDF(i) Probability density function of the fitted distribution. For the Poisson and negative 
binomial distributions, the density function for the last class is computed as 
PDF(k) = 1 — S>*7 PDF (1) 

5; Expected frequency for class i: £; = W*PDF(i) 


Ww The total effective sample size 


For large W, the above statistic follows the Chi-Square distribution: 


2 
DA Xecice 


where r = number of parameters estimated from the data. The following table provides the values 
of k and r for the various distributions. The value Max in the table is the observed maximum value. 


Distribution Notation k (classes) r (parameters) 
Binomial Bin (av;N,P) N+1 2 

Categorical Categorical (x;pi,.-.,ps) J J-1 

Poisson Pois(x;2) Max+1 1 

Negative binomial NB(a;r,6) Max + 1 2 


This Chi-Square test is valid only if all values of FE; > 5. 
The p-value for the Chi-Square test is then calculated as: 


pe — F (T,x2_1_,) 


where /°(7, \;,_,_,.) = CDF of the Chi-Square distribution. 


Note: The p-value cannot be calculated for the Categorical distribution since the number of 
degrees of freedom is zero. 


Continuous distributions 


For continuous distributions, the Anderson-Darling test or the Kolmogorov-Smirnov test is used 
to determine goodness of fit. The calculation consists of the following steps: 


1. Transform the data to a Uniform(0,1) distribution 
2. Sort the transformed data to generate the Order Statistics 
3. Calculate the Anderson-Darling or Kolmogorov-Smirnov test statistic 


4. Compute the approximate p-value associated with the test statistic 
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The first two steps are common to both the Anderson-Darling and Kolmogorov-Smirnov tests. 
The original data are transformed to a Uniform(0,1) distribution using the transformation: 


y= FP (#) 


where the transformation function F (:) is given in the table below for each of the supported 


distributions. 
Distribution Transformation F(x) 
1 2a)? 
Triag(x; a,m, b) (b—a) (m—a)”? x € [a,m) 
ave z= m 
1- gates, 2e (mb 
0, xz [a,b] 
U (x; a,b) — 
Nor(a, 1,0) ® (=) 
LN(a,p,0) oe) 
Exp(x;X) 1—e7* 
Beta(a;a, 3) x. 
1 v1 i 
t (1 —t) dt 
‘ B(a ) 
0 
Gamma(ax;a,3) r, 
/ pots “dt 
‘0 
W eib(x; 3,7) { 22 


The transformed data points y; are sorted in ascending order to generate the Order Statistics: 


yi) S ya) S--- S Y(n—1) S Yn) 


Define w; to be the corresponding frequency weight for ,;). The cumulative frequency up to and 


including y,;) is defined as: 


and where we define WW; = 0. 
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Anderson-Darling test 


The Anderson-Darling test statistic is given by: 


3 


rk L " * rk * 1 * rk * 
z=-W, - Wwe > w; (20 j-1 + W; ) In (yi) + TE w; (20 co we; ) In (1 — yi)) 
og) ite | 


For more information, see the topic “Anderson-Darling statistic with frequency weights”. 
The approximate p-value for the Anderson-Darling statistic can be computed for the following 


distributions: uniform, normal, lognormal, exponential, Weibull and gamma. The p-value is not 
available for the triangular and beta distributions. 


Uniform distribution: p-value 


The p-value for the Anderson-Darling statistic is computed based on the following result, provided 


by Marsaglia (2004): 

_ (cs go te teeeg (z) . a for ze (0, 2) 
7 1 Eee, 1776—(2.30695—(0.43424—(0.082433—(0.008056—0.00031462)z)z)z)z for > C (2, x) 
where 


g(z) = (2.00012 + (0.247105 — (0.0649821 — (0.0347962 — (0.0116720 — 0.00168691z)z)z)z)z) 


Normal and lognormal distributions: p-value 


The p-value for the Anderson-Darling statistic is computed based on the following result, provided 
by D’ Agostino and Stephens (1986): 


1 — exp (—13.436 + 101.142* — 223.732*?), 2* < 0.2 

1 — exp (—8.318 + 42.7962* — 59.9382*”), 0.2 < z* < 0.34 
P= exp (0.9177 — 4.2792z* — 1.382*?), 0.34 < z* < 0.6 

exp (1.2937 — 5.7092* + 0.01862*?), 0.6 < 2* < 153.467 

0 2* > 153.467 


where 


2* = z (1.0 + 0.75/W,x + 2.25/W;*) 
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Exponential distribution: p-value 


The p-value for the Anderson-Darling statistic is computed based on the following result, provided 
by D’ Agostino and Stephens (1986): 


1 — exp (—12.2204 + 67.4592* — 110.3z*?), z* < 0.260 
1 — exp (—6.1327 + 20.2182* — 18.663z*”), 0.260 < 2* < 0.510 
p= exp (0.9209 — 3.3532* + 0.32*”), 0.510 < 2* < 0.950 
exp (0.731 — 3.0092* + 0.15z*?), 0.950 < z* < 10.03 

0 z* > 10.03 


where 


2* = z(1.0+0.6/W*) 


Weibull distribution: p-value 


The p-value for the Anderson-Darling statistic is computed based on Table 94-3 below, provided by 
D’ Agostino and Stephens (1986). First, the adjusted Anderson-Darling statistic is computed from: 


s* = 2(1+02//W;) 


If the value of =* is between two probability levels (in the table), then linear interpolation is used 
to estimate the p-value. For example, if 2* = 0.543 which is between 2} = 0.474 and 25= 0.637 
then the corresponding probabilities of 2 and 25 are p, = 0.25 and p, = 0.1 respectively. Then 
the p-value of z* is computed as 

P2— Pl; x P 0.1 — 0.25 


Dp = ——"(z zj)tp = — 0.543 — 0.474) + 0.25 = 0.1865 
Oa al esa 


If the value of z* is less than the smallest critical value in the table, then thep-value is > 0.25; and 
if z* is greater than the largest critical value in the table, then the p-value is < 0.01. 


Table 94-3 

Upper tail probability and corresponding critical values for the Anderson-Darling test, for the Weibull 
distribution 

p-value 0.25 0.10 0.05 0.025 0.01 
z(140.2//W;) | 0.474 0.637 0.757 0.877 1.038 


Gamma distribution: p-value 


Table 94-4, which is provided by D’Agostino and Stephens (1986), is used to compute the p-value 
of the Anderson-Darling test for the gamma distribution. First, the appropriate row in the table 

is determined from the range of the parameter a. Then linear interpolation is used to compute 
the p-value, as done for the Weibull distribution. For more information, see the topic “Weibull 
distribution: p-value”. 
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If the test statistic is less than the smallest critical value in the row, then the p-value is > 0.25; and 
if the test statistic is greater than the largest critical value in the row, then the p-value is < 0.005. 


Table 94-4 


Upper tail probability and corresponding critical values for the Anderson-Darling test, for the gamma 
distribution with estimated parameter a 


p-value 0.25 0.10 0.05 0.025 0.01 0.005 
a € (0,1] 0.486 0.657 0.786 0.917 1.092 1.227 
ae 1,8] 0.473 0.637 0.759 0.883 1.048 1.173 
a € (8, 00) 0.470 0.631 0.752 0.873 1.035 1.159 


Kolmogorov-Smirnov test 


The Kolmogorov-Smirnov test statistic, D,,, is given by: 


Ws 
D* =mazjlyj) — We D™ =maxjyi) — 
7 


Dy, = mar(D*, D-) 


Computation of the p-value is based on the modified Kolmogorov-Smirmov statistic, which is 
distribution specific. 


Uniform distribution: p-value 


The procedure proposed by Kroese (2011) is used to compute the p-value of the 
Kolmogorov-Smirnov statistic for the uniform distribution. First, the modified 
Kolmogorov-Smirnov statistic is computed as 


D= JW, Dn 
The corresponding p-value is computed as follows: 


1. Set k=100 
2. Define a, = eek (—1)'e-2002)" 
3. Calculate a; and a;.; 


4. If Jay, — ag. | > 107° set k=k+1 and repeat step 2; otherwise, go to step 5. 


5. p-value = 1— a, 


Normal and lognormal distributions: p-value 


The modified Kolmogorov-Smirnov statistic is 
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0.85 
D =D, | /W* -0.014 + 
V fs 


The p-value for the Kolmogorov-Smirnov statistic is computed based on Table 94-5 below, 
provided by D’ Agostino and Stephens (1986). If the value of D is between two probability levels, 
then linear interpolation is used to estimate the p-value. For more information, see the topic 
“Weibull distribution: p-value”. 


If D is less than the smallest critical value in the table, then the p-value is > 0.15; and if D is 
greater than the largest critical value in the table, then the p-value is < 0.01. 
Table 94-5 


Upper tail probability and corresponding critical values for the Kolmogorov-Smirnov test, for the 
Normal and Lognormal distributions 


p-value 0.15 0.10 0.05 0.025 0.01 


D 0.775 0.819 0.895 0.995 1.035 


Exponential distribution: p-value 


The modified Kolmogorov-Smirnov statistic is 
D = (Dn —0.2/Wr) (JW + 0.26 + 0.5/\/W; 


The p-value for the Kolmogorov-Smirnov statistic is computed based on Table 94-6 below, 
provided by D’ Agostino and Stephens (1986). If the value of D is between two probability levels, 
then linear interpolation is used to estimate the p-value. For more information, see the topic 
“Weibull distribution: p-value”. 


If D is less than the smallest critical value in the table, then the p-value is > 0.15; and if D is 
greater than the largest critical value in the table, then the p-value is < 0.01. 
Table 94-6 


Upper tail probability and corresponding critical values for the Kolmogorov-Smirnov test, for the 
Exponential distribution 


p-value 0.15 0.10 0.05 0.025 0.01 


D 0.926 0.995 1.094 1.184 1.298 


Weibull distribution: p-value 


The modified Kolmogorov-Smirnov statistic is 
D= J/WED» 


The p-value for the Kolmogorov-Smirnov statistic is computed based on Table 94-7 below, 
provided by D’Agostino and Stephens (1986). If the value of D is between two probability levels, 
then linear interpolation is used to estimate the p-value. For more information, see the topic 
“Weibull distribution: p-value”. 
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If D is less than the smallest critical value in the table, then the p-value is > 0.10; and if D is 
greater than the largest critical value in the table, then the p-value is < 0.01. 


Table 94-7 

Upper tail probability and corresponding critical values for the Kolmogorov-Smirnov test, for the 
Weibull distribution 

p-value 0.10 0.05 0.025 0.01 

D 1.372 1.477 1.557 1.671 


Gamma distribution: p-value 


The modified Kolmogorov-Smirnov statistic is 
pap. (JW; 0.3// Wa) 


The p-value for the Kolmogorov-Smirnov statistic is computed based on Table 94-8 below, 
provided by D’ Agostino and Stephens (1986). If the value of D is between two probability levels, 
then linear interpolation is used to estimate the p-value. For more information, see the topic 
“Weibull distribution: p-value”. 


If D is less than the smallest critical value in the table, then the p-value is > 0.25; and if D is 
greater than the largest critical value in the table, then the p-value is < 0.005. 


Table 94-8 


Upper tail probability and corresponding critical values for the Kolmogorov-Smirnov test, for the 
Gamma distribution 


p-value’ | 0.25 0.20 0.15 0.10 0.05 0.025 0.01 0.005 
D 0.74 0.780 0.800 0.858 0.928 0.990 1.069 1.13 


Determining the recommended distribution 


The distribution fitting module is invoked by the user, who may specify an explicit set of 
distributions to test or rely on the default set, which is determined from the measurement level 

of the input to be fit. For continuous inputs, the user specifies either the Anderson-Darling test 
(the default) or the Kolmogorov-Smirnov test for the goodness of fit measure (for ordinal and 
nominal inputs, the Chi-Square test is always used). The distribution fitting module then returns 
the values of the specified test statistic along with the calculated p-values (if available) for each of 
the tested distributions, which are then presented to the user in ascending order of the test statistic. 
The recommended distribution is the one with the minimum value of the test statistic. 


The above approach yields the distribution that most closely fits the data. However, if the p-value 
of the recommended distribution is less than 0.05, then the recommended distribution may not 
provide a close fit to the data. 


Simulation algorithms 


Anderson-Darling statistic with frequency weights 


To obtain the expression for the Anderson-Darling statistic with frequency weights, we first give 
the expression where the frequency weight of each value is 1: 


z= —n — 2, (26-1) [In (ya (1 — Yng_a))] 
=—n— ooh , (21 — 1) [In (yy) +n (1 —Yn4i iy) | 
= —n — 157 , (2% - 1) [In (y;))] — al (2(n+1—-i)- 1) [In (1 — Yi) 
=—n—1y~? (28-1 [In w)] —2 2, — 22) [m9 yw) — SL 2m (1 -— 
=A+BC+D 


If there is a frequency weight variable, then the corresponding four terms of the above expression 
are given by: 


A=-W> 


_ We er a (2(W;, pee j) = 1) [In (ya) | 
— 7 , we (2Ws , — 1) In (y(; )) as * py ,w; (wi +1)In (yi) 
er ws (2W* +2 v*) In (y ()) 


ai ei = Wii + j)) {Im (1 — Yi) | 
= a a wi (1 — 2 ae In (1 — yiy) es ee yy wi (we +: 1)In (1 = yi) 


ies iu (OW, + w7) In (1 - ya) 
—257"  w* In (1 — yy) 


where w; and W* are defined in the section on goodness of fit measures for continuous 
distributions. For more information, see the topic “Continuous distributions”. 
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Simulation algorithms: run simulation 


Running a simulation involves generating data for each of the simulated inputs, evaluating the 
predictive model based on the simulated data (along with values for any fixed inputs), and 
calculating metrics based on the model results. 


Generating correlated data 


Simulated values of input variables are generated so as to account for any correlations between 
pairs of variables. This is accomplished using the NORTA (Normal-To-Anything) method 
described by Biller and Ghosh (2006). The central idea is to transform standard multivariate 
normal variables to variables with the desired marginal distributions and Pearson correlation 
matrix. 


Suppose that the desired variables are Xj, 7 = 1,---,k, with the desired Pearson correlation 
matrix 2, where the elements of 2 y are given by p;;. Then the NORTA algorithm is as follows: 


1. For each pair X; and Xj, where 7 < j, use a stochastic root finding algorithm (described in the 
following section) and the Aelaien p;; to search for an approximate correlation p;; of standard 
bivariate normal variables. 


2. Construct the symmetric matrix £, whose elements are given by p;;, where p;; = 1 andp;; = p;;. 
3. Generate the standard multivariate normal variables 7,,---, Z;, with Pearson correlation matrix &,. 
4. Transform the variables 7,,---, 7), to X,,---, X;, using 


X; = Fo '(®(Z,)),i=1,-++,k 


where F;, is the desired marginal cumulative distribution, and & () is the cumulative standard 
normal distribution function. Then the correlation matrix of Y,,---,X;, will be close to the 
desired Pearson correlation matrix = ,. 


Stochastic root finding algorithm 


Given a correlation p;;, a stochastic root finding algorithm is used to find an approximate 
correlation »;, such that if standard bivariate normal variables 7; and 7; have the Pearson 
correlation ,;,, then after transforming Z; and Z; to x; and X; (using the transformation described 
in Step 4 of the previous section) the Pearson conelation heeween X;, and Xj is close to p;;. The 
stochastic root finding algorithm is as follows: 


1. Let LowCorr = —1 and HighCorr = 1 


2. Simulate N samples of standard normal variables 7; ‘) and Z; ey ZS and Z; ‘such that the 
Pearson correlation between 7; ‘") and Z; ‘") is LowCorr and fhe peicon eircidioe between 
Z; 1) and Z; ‘“) is HighCorr. The saraple size N is set to 1000. 


(L) eae -(H) 
X; 


3. Transform the variables 7; “) 4; eh) 2; ) and zy ' to the variables Y: and 


x! (4) using the TansFarTHauon described in Step 4 of the previous section: 
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4. Compute the Pearson correlation between x (L) and X; ‘") and oe it as ph . Similarly, compute 


the Pearson correlation between Y; " and xs “(H) and denote it as pit 


5. If the desired correlation pis < l pi; pit then stop and set pi, = LowCorr if p 


- or set 
p;; = HighCorr if pi; = wt as ae ae go to Step 6. : 


j > Pay 


6. Simulate N samples of standard bivariate normal variables 7; and Z, ‘“) with a Pearson 
correlation of ay jgCorr = 4(LowCorr + HighCorr): 4S in Steps 3 and 4, transform A and 
Z; Mo x M) and x (M) fate compute the Pearson correlation between X, aM and y( ), which 
vg 


will be denoted , 
Zi We 


7. If |pi; —p | < €or |HighCorr — LowCorr| < € where ¢ is the tolerance level (set to 0.01), then 
stop and set p?; = MidCorr. Otherwise go to Step 8. 


8. If p,; > piy, Set LowCorr = MidCorr, else set HighCorr = MidCorr and return to Step 6. 


Inverse CDF for binomial, Poisson and negative binomial distributions 


Use of the NORTA method for generating correlated data requires the inverse cumulative 
distribution function for each desired marginal distribution. This section describes the method for 
computing the inverse CDF for the binomial, Poisson and negative binomial distributions. Two 
parameterizations of the negative binomial distribution are supported. The first parameterization 


describes the distribution of the number of trials before the 'P success, whereas the second 
parameterization describes the distribution of the number of failures before the rth success. 


The choice of method for determining the CDF depends on the mean , of the distribution. If 

jt > Threshold, where Threshold is set to 20, the following approximate normal method will be 
used to compute the inverse CDF for the binomial distribution, the Poisson distribution and the 
second parameterization of the negative binomial distribution. 


X = [F7'(0(Z))] = [o(®7'(®(Z))) +n] = [07 + 4] 

For the first parameterization of the negative binomial distribution, the formula is as follows: 
X=(oZ+ pu) +r 

The parameters ;: and o are given by: 


= Binomial distribution. p. = NP ando NP (1 — P), where N is the number of trials and P 
is the probability of success. 

= Poisson distribution. 1. = A and oc = VX. where A is the rate parameter. 

= Negative binomial distribution (both parameterizations). 1. = r+—2 and o = \rigt. where r is 
the specified number of successes and @ is the probability ats cen 


The notation [2] used above denotes the integer part of z. 


If 4: < Threshold then the bisection method will be used. 


Simulation algorithms 


Suppose that x and z are the values of X and Z respectively, where X is a random variable with a 
binomial, Poisson or negative binomial distribution, and Z is a random variable with the standard 


f(x) 


normal distribution. The objective function to be used in the bisection search method is 
as follows: 


= Binomial distribution. f(2) = 1 — Pr(B(x# +1, N — 2) < P) — ®(z) 
= Poisson distribution. f(a) = 1—Pr(G(a+1,1) < A) — ®(z) 


= Negative binomial distribution (second parameterization). f(a) = Pr (B(r,z +1) < 0) — ®(z) 


where B(a, 3) and G(a, 3) are random variables with the beta distribution and gamma 
distribution, respectively, with parameters a and ;3. 


The bisection method is as follows: 


1. If f(s) =0 then stop and sete = {j: + 0.5]. Otherwise go to step 2 to determine two values 
x, and x» such that f(x.) x f(a«2) <0. 


2 If f(u) > othen let 7, = 0 and zy, = [ue If f(u) < 0 then let 7, = 27-! x wand x, = 2/7 x [bs 
where 7 is the minimum integer such that f(a.) x (a2) < 0- 


3. Letm= = (x +x). If |f(m)| < € or |x; — x2. < 1 where € is a tolerance level, which is set to 


10~°, then stop and set x = [m + 0.5]. Otherwise go to step 4. 


4. If f(m) > 0, let x. = m, else let x; = m and return to step 3. 


Note: The inverse CDF for the first parameterization of the negative binomial distribution is 
determined by taking the inverse CDF for the second parameterization and adding the distribution 
parameter r, where r is the specified number of successes. 


Sensitivity measures 


Sensitivity measures provide information on the relationship between the values of a target and 
the values of the simulated inputs that give rise to the target. The following sensitivity measures 
are supported (and rendered as Tornado charts in the output of the simulation): 

™ Correlation. Measures the Pearson correlation between a target and a simulated input. 


= One-at-a-time measure. Measures the effect on the target of modulating a simulated input 
by plus or minus a specified number of standard deviations of the input. 


™ Contribution to variance. Measures the contribution to the variance of the target from 
a simulated input. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table 94-9 
Notation 


Notation Description 
n Number of records of simulated data 


Simulation algorithms 


An n x p matrix of values of the inputs to the predictive model. The 

rows 2 = (@i1,...,2ip): i= 1,...,n contain the values of the inputs 

for each simulated record, excluding the target value. The columns 

x} = (a1;,...,2nj): j =1,...,p represent the set of inputs. 

Ann x 1 vector of values of the target variable, consisting of y;,i = 1,...,n 


F(X) A known model which can generate y from X 


The value of a sensitivity measure for the input x; 


Correlation measure 


The correlation measure is the Pearson correlation coefficient between the values of a target 
and one of its simulated predictors. The correlation measure is not supported for targets with a 
nominal measurement level or for simulated inputs with a categorical distribution. For more 
information, see the topic “Pearson Correlation”. 


One-at-a-time measure 


The one-at-a-time measure is the change in the target due to modulating a simulated input by plus 
or minus a specified number of standard deviations of the distribution associated with the input. 
The one-at-a-time measure is not supported for targets with an ordinal or nominal measurement 
level, or for simulated inputs with any of the following distributions: categorical, Bernoulli, 
binomial, Poisson, or negative binomial. 


The procedure is to modulate the values of a simulated input by the specified number of standard 
deviations and recompute the target with the modulated values, without changing the values of 
the other inputs. The mean change in the target is then taken to be the value of the one-at-a-time 
sensitivity measure for that input. 


For each simulated input x; for which the one-at-a-time measure is supported: 
Define the temporary data matrix X = X 


Add the specified number of standard deviations of the input’s distribution to each value of 
jin X’. 


Calculate y’ = F(X’) 
Calculate sq, = 15>" | (v' = vi) 


Repeat Step 2, but now subtracting the specified number of standard deviations from each value of 
«x ;. Continue with Steps 3 and 4 to obtain the value of sa; in this case. 
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Contribution to variance measure 


The contribution to variance measure uses the method of Sobol (2001) to calculate the total 
contribution to the variance of a target due to a simulated input. The total contribution to variance, 
as defined by Sobol, automatically includes interaction effects between the input of interest 

and the other inputs in the predictive model. 


The contribution to variance measure is not supported for targets with an ordinal or nominal 
measurement level, or for simulated inputs with any of the following distributions: categorical, 
Bernoulli, binomial, Poisson, or negative binomial. 


Let ’ be an additional set of simulated data, in the same form as Y and with the same number 
of simulated records. 


Define the following: 


fo==) Yi 


i 
D= ye — (fo) 
i=1 


n 
1 3 pty : 2 
Dy\e. = a Yi [F (X Lj F, Xg\ ep, \ = (fo) 
DL 
where 
m x \ x; denotes the set of all inputs excluding x; 
m XX’. +X,,,,, isa derived data matrix where the column associated with «x; is taken from 


x ‘and the remaining columns (for all inputs excluding ;) are taken from X 


The total contribution to variance from x; is then given by 
D—- Dy, 
koe 


Sa, = 


Note: When interaction terms are present, the sum of the sa, over all simulated inputs for which 
the contribution of variance is supported, may be greater than 1. 
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1. Introduction 


Since association rule mining (Agrawal and Srikant, 1994) has been proposed, many algorithms have 
emerged and been successfully applied to real-world applications. In recent years, because of the 
importance and necessity of analyzing geospatial data in different industries, spatial data mining 
approaches have gained lots of interests. Among the existing spatial data mining approaches, the spatial 
association rule mining proposed by Koperski and Han (1995) is one of the most typical approaches for 
spatial pattern discovery. 


As defined by Koperski and Han (1995), a spatial association rule is a rule that describes the implication of 
one or a set of spatial objects by another set of spatial objects in spatial databases. The spatial objects 
involved therefore can be classified into two groups: the event objects and the geo-context objects. 


e = The event objects are the research targets of rule mining, which means that the rules discovered are 
about the spatial patterns of the event objects. 
e The geo-context objects are used to describe the patterns of the event objects. 


The patterns are represented by spatial relationships (e.g., topological relationships) defined between each 
pair of an event object and a geo-context object. 


An example of a spatial association rule, event and geo-context objects, and spatial relationships is given 
below. 


Example 1 
Rule: Most crime cases within census tract No. 1 are close to Freya St (street). 


This is a spatial association rule discovered in a spatial database containing crime cases and map elements. 
The crime cases are event objects. Census tract No. 1 and Freya St are specific geo-context objects. The 
whole set of geo-context objects may include all the census tracts, streets and roads, and other map 
elements in the database. Within and close to are spatial relationships defined between the crime cases and 
the census tract and the road. 


The spatial relationships can be symbolically represented by the spatial predicates of event objects. 
Definition 1: Spatial Predicate 

A spatial predicate can be seen as a spatial attribute of an event object. It is defined by an ordered 2-tuple 
<r, o>, where r is a spatial relationship, o is a geo-context object. 


By using the spatial predicates, the rule in Example 1 can be written as: 


<Within, Tract1> = <Close to, Freya St> (a%, 1%) (1) 
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where <Within, Tract1> is the condition of the rule, and <Close to, Freya St> is the prediction of the rule, 
both of which are spatial predicates. In parenthesis, a% denotes condition support, implying how many 
crime cases (in percentage) satisfy the condition of the rule, i.e., <Within, Tract1>. Symbol 1% denotes the 
rule lift, implying the ratio of confidence for the rule (the probability that the prediction is true given that 
the condition is true) to the prior probability of having the prediction of the rule. Therefore, lift measures 
the gain in prediction accuracy by using the rule. Rules without sufficiently large a% and 1% could not be 
regarded as significant. Besides condition support and lift, there are other statistics controlling the rule 
generation, which will be explained in details in section 2.5. 


A spatial association rule describes the spatial distribution pattern of a set of event objects by their spatial 
relationships with the geo-context objects. However, as analyzed by Dong et al. (2012b), existing spatial 
association rule mining approaches have a major limitation that they cannot effectively involve all available 
non-spatial information of the spatial objects. As a result, many interesting rules expressing richer 
information (e.g., the combinations of spatial and non-spatial information) cannot be found even if non- 
spatial information that could be useful for rule discovery is available. 


This document describes the Generalized Spatial Association Rule (GSAR) mining algorithm, which 
remedies this significant shortcoming. GSAR is capable of exploiting all available information ofthe 
spatial objects, including spatial and non-spatial information. 


N 


Generalized Spatial Association Rule (GSAR) 


GSAR combines and extends the merits of the traditional spatial association rule (Koperski and Han, 1995) 
and the generalized association rule (Srikant and Agrawal, 1995, Han and Fu, 1995) so that much more 
information can be involved in analysis than ever before. 


The overall flow of the Generalized Spatial Association Rule (GSAR) mining algorithm includes the 
following steps. 


Step 1: User specifies event objects and geo-context objects, as well as spatial predicates and criteria for 
mining association rules. 


Step 2: Compute spatial relationships between event objects and geo-context objects, and construct spatial 
predicate transaction table. 


Step 3: Involve non-spatial attributes of event objects, if provided. 


Step 4: Involve non-spatial attributes of geo-context objects, if provided. 
Step 5: Rule mining, and the output will be GSAR. 


Each of the steps is described in the following subsections. We use the crime analysis in Example 1 asa 
sample scenario to explain each step. 


2.1. Initialization 


First, the user needs to specify which spatial objects in the inputs are event objects, and which are geo- 
context objects. Suppose we are given a set of crime history, where each crime case has latitude and 
longitude coordinates, and multiple map layers are available. To analyze the crime patterns using the map 
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layers, a user can specify the crime cases as event objects, and all the census tracts (may appear aspolygons 
in the database), streets and roads (may appear as polylines) as geo-context objects. Usually, some 
attributes of all these spatial objects are also available (e.g., the type of crime, the population density of 
census tracts, etc.). Such information is important for discovering interesting patterns. Typically, an 
attribute table is associated with each layer of spatial objects. 


2.2. Construction of Spatial Predicate Transaction Table 


To mine spatial association rules, a spatial predicate transaction table of event objects needs to be 
constructed. Each row of the transaction table contains the spatial predicates of one event object. Let us 
consider the crime analysis scenario in Example 1. A sample spatial predicate transaction table of the crime 
cases is given in Table 1, where the ID column represents crime case identifier. The “close to” relationship 
can be defined by a condition such as a distance less than 500 feet. The spatial relationships “within” and 
“close to” used here are two typical topological relationships. Other types of spatial relationships, such as 
directional relationships, can also be used. 


Table 1: A Sample Spatial Predicate Transaction Table 


<Within, Tract1>, <Close to, Freya St>, <Close to, Wellesley Av> 


<Within, Tract2> 


<Within, Tract1>, <Close to, Freya St> 


By treating spatial predicates as items’, a traditional association rule mining algorithm, such as Apriori, can 
be applied to the spatial predicate transaction table, producing spatial patterns of crimes as rules. 
Nonetheless, the spatial predicates only express the spatial attributes of event objects. Non-spatial attributes 
of reference and geo-context objects, which can also be useful for finding interesting rules, are not included 
in the transaction table. In fact, involving non-spatial attributes of event objects is straightforward, which is 
done in the next step. 


2.3. Involving Non-Spatial Attributes of Event objects 


In some cases where non-spatial attributes of event objects are not available, or the user does not want to 
involve them in analysis, this step can be omitted. Otherwise, the information can be involved by 
expanding the transaction table by joining available non-spatial attributes? of event objects by their unique 


‘Ttems can be flag-type conditions that indicate the presence or absence of a particular thing in a specific 
transaction or simply categories of categorical variables. 


* The non-spatial information is either categorical, or discretized according to user specified cutpoints, or 
discretized automatically through equal width binning. 
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identifiers. Suppose each crime case has two non-spatial attributes, crime type and day of week on which it 
was reported. After joining, the resulting expanded transaction table will look like Table 2. 


Table 2: Spatial Predicate Transaction Table Expanded from Table 1 


Drugs, Tuesday <Within, Tract1>, <Close to, Freya St>, <Close to, Wellesley Av> 


Robbery, Friday <Within, Tract2> 


Vehicle Theft, Monday <Within, Tract1>, <Close to, Freya St> 


2.4. Involving Non-Spatial Attributes of Geo-Context Objects 


Available non-spatial attributes of geo-context objects are also important for finding interesting patterns. 
However, if such non-spatial attributes are involved (as above) and a traditional association rule mining 
algorithm is applied, a large number of redundant patterns that provide nothing interesting can be 
generated. For instance, suppose Freya St has a non-spatial attribute Load=Heavy. If we simply append this 
attribute to Table 2, treat it as a usual item, and run Apriori, the following itemset? of length two can be 
found as frequent: 


{<Close to, Freya St>, <Close to, Load=Heavy>} 


This itemset correlates Freya St with its attribute Load=Heavy. Nonetheless, it provides nothing new to the 
known information and therefore is redundant. Any rule generated from it thus is also meaningless. Ideally, 
such patterns should be prevented from pattern generation. However, traditional association rule mining 
treats items equally and independently. Therefore, in the above case, Freya St and Load=Heavy cannot be 
prevented from appearing within the same itemset or rule. 


GSAR supports user-specified pairs of fields to exclude, so that results can be more relevant and 
interesting. After removing those pairs, generalized spatial predicates are inferred from spatial predicates. 
For example, as given in table 3, <Within, POPDEN=Low> and <Within, RMF=Avg> are generalized 
spatial predicates of <Within, Tract1>, and <Close to, Road> is a generalized spatial predicate of <Close to, 
Freya St>. We can find that a non-spatial attribute (or a concept) of a geo-context object can only appear in 
a generalized spatial predicate by replacing the corresponding geo-context object. 


Table 3: Transaction Table Further Expanded from Table 2 with Generalized Spatial Predicates 


1 Drugs, Tuesday <Within, Tractl>, <Close to, Freya St>, <Within, POPDEN=Low>, <Within, RMF=Avg>, 
<Close to, Wellesley Av> <Close to, Road> 


2 Robbery, Friday <Within, Tract2> <Within, POPDEN=VeryHigh>, <Within, RMF=Avg> 


3 Vehicle Theft, Monday <Within, Tract1>, <Close to, Freya St> <Within, POPDEN=Low>, <Within, RMF=Avg>, 
<Close to, Road> 


3 An itemset is a group of items which may or may not tend to co-occur within transactions. 
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2.5. Rule Mining 


All the above steps can be regarded as data preparation for Generalized Spatial Association Rules (GSAR) 
mining. Now we give a definition of GSAR. A Generalized Spatial Association Rule (GSAR) extends the 
traditional spatial association rule so that it can contain (1) spatial relationships, (2) non-spatial attributes of 
event objects, and (3) non-spatial attributes of geo-context objects. 


As can be seen, in the GSAR mining algorithm, above points (3) are expressed via the generalized spatial 
predicates. As described in previous sections, the expanded spatial predicate transaction table which 
contains non-spatial attributes of event objects (denoted by 1), is further expanded with generalized spatial 
predicates and becomes T’. Taking this as input, frequent itemsets are generated from T’, and pruning is 
applied on itemsets with length two. Then rules are generated. Details are summarized in Algorithm 1. 


Algorithm 1: Finding Generalized Spatial Association Rules (GSAR) 


Input: An expanded spatial predicate transaction table T; user input excluded field pairs S, minimum 
condition support dmin; minimum lift Imin; minimum rule support Smin; minimum confidence Cmin}; maximum 
rule length L. 


Output: A set of rules 


// Rule mining with redundant pattern pruning 
1) Start from iteration k=1. Each item is a candidate itemset at this level. 


2) Treat all the elements in T’ as ordinary items, and scan all candidate itemsets of size k to see if there 
are any with frequency exceeding a predetermined threshold of minimum rule support Smin. If yes, 
continue; otherwise, the iteration ends. 


3) Find all candidate itemsets of length k+1 (for k=1 the itemsets covered by user input excluded field 
pairs S are pruned). If such candidate itemsets are found, continue; otherwise, the iteration ends. 


4) From candidate itemsets, find frequent itemsets with support above or equal to Gmin. 


5) Based on the frequent itemsets, generate association rules with lift values above or equal to Imin, and 
condition support above or equal to dmin, and confidence values above or equal to minimum 
confidence Cin. 


6) Increase k by one. If k < L then go to step 2); otherwise, the iteration ends. 


Note that GSAR turns to simple Apriori if geo-context objects or pairs of fields to exclude are not given. 
More details of the GSAR algorithm, as well as its time complexity analysis, can be found in Dong et al. 
(2012). 


2.6. Evaluation Measures 


The different measures emphasize different aspects of the rules. 
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Name Range Comments 

Condition [0,100%] Proportion of input transactions that contain 
support the condition. 

Rule Support [0,100%] Proportion of input transactions that contain 


the entire rule: conditions, and prediction. 
This indicates what percentage of the 
prediction in the inputs can be predicted 
through the condition. 
Confidence [0,100%] Ratio of rule support to condition support. 
This is the probability of having the prediction 
in the condition population. 
Lift [1.0, infinity) Ratio of confidence for the rule to the prior 
probability of having the prediction. For 
example, if 10% of the entire population buys 
bread, then a rule that predicts whether 
people will buy bread with 20% confidence 
will have a lift of 20/10 = 2. If another rule 
tells you that people will buy bread with 11% 
confidence, then the rule has a lift of close to 
1, meaning that having the condition does not 
make a lot of difference in the probability of 
having the prediction. In general, rules with 
lift different from 1 will be more interesting 
than rules with lift close to 1. The GSAR 
component requires minimum lift to be no 
less than 1. 
Deployability [0,100%] The percentage of the transactions that 
contain the condition but do not contain the 
prediction. In product purchase terms, it 
basically means what percentage of the total 
customer base owns (or has purchased) the 
condition but has not yet purchased the 


3. Appendix 


3.1. Detection of most interesting rules 


Denote pes as the value of rule i (i = 1, ..., I) along rule measurement dimension m (m = 
1, ..., M). The rule i is considered “interesting” along dimension m if 


4 >) 47 esp 


where T is a constant, by default set to 3. X ©) and SD“ is the mean and standard deviation for 
dimension m, respectively. They are computed using the following steps: 


1. Start with: wi” =X" =0, 


2. Compute statistics below for i= 1 tol. (Skip to next data record when ce is missing): 
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7(m) _ fq7(m) 
W™ =W& +1, 


1 = 
zm) _ (m) _ +m) 
eS a) Qe” — Xi) 


XO) = XO 40 
3. After the last rule i= 1 has been processed, return the following result 


ym =x 


(m) 


i 


The effect size of interestingness Q; ~ is evaluated by 


0, if x6” < X™ +47 *sD™ 
Qg™ = xf™ — Xm _ 
om iT tx > XM + T+ sp™ 


(m) 


A rule with greater Q; “ is more interesting. 


References 


[1]. Agrawal, R. and Srikant, R. (1994), “Fast algorithms for mining association rules,” in VLDB, 
1994, pp. 487-499. 


[2]. Dong, W., Li, L., Zhou, C., Wang, Y., Li, M., Tian, C., and Sun, W. (2012), “Discovery of 
generalized spatial association rules,” in IEEE International Conference on Service Operations 
and Logistics, and Informatics (SOLI), 2012, pp. 60-65. 


[3]. Han, J. and Fu, Y. (1995) “Discovery of multiple-level association rules from large databases,” in 
VLDB, 1995, pp. 420-431. 


[4]. Koperski, K. and Han, J. (1995), “Discovery of spatial association rules in geographic information 
databases,” in Advances in spatial databases, 1995, pp. 47-66. 


[5]. Srikant, R. and Agrawal, R. (1995), “Mining generalized association rules,” in VLDB, 1995, pp. 
407-419. 


SPATIAL ASSOCIATION RULES Algorithms 


SPATIAL TEMPORAL PREDICTION Algorithms 


1. Introduction 


1.1 Background 


Spatio-temporal statistical analysis has many applications. For example, energy management 
for buildings or facilities, performance analysis and forecasting for service branches, or 
public transport planning. In these applications, measurements such as energy usage are often 
taken over space and time. The key questions here are what factors will affect future 
observations, what can we do to effect a desired change, or to better manage the system. In 
order to address these questions, we need to develop statistical techniques which can forecast 
future values at different locations, and can explicitly model adjustable factors to perform 
what-if analyses. 


However, these analytical needs are not the focus of traditional spatio-temporal statistical 
research. In traditional statistical research, spatio-temporal analysis is treated just as an 
extension of spatial analysis and focuses more on looking for patterns in past data rather than 
forecasting future values. The traditional spatio-temporal research targets different 
application areas such as environmental research. There are, however, different types of 
spatio-temporal problems in which time is the key component. We therefore need to treat 
spatio-temporal analysis as a unique type of problem itself, not an extension to spatial 
analysis. Moreover, we need to explicitly model these factors to allow for what-if analysis. 
Although these kinds of problems could be addressed by traditional methods, the emphasis is 
quite different. 


This algorithm assumes a fixed set of spatial locations (either point location or center of an 
area) and equally spaced time stamps common across locations. It can issue predicted or 
interpolated values at locations with no response measurements (but with available 
covariates). We call our model spatio-temporal prediction (STP). 


The goal of the STP algorithm is to address the needs for solving the spatio-temporal 
problems. STP can generate predictions at any location within a 3D space for any future time. 
It also explicitly models the external factors so we can perform what-if analysis. 


1.2 Handling of missing data 


The algorithm is designed to accommodate missing values in the response variable, as well as 
in the predictors. We consider an observation at a given time point and location ‘complete’ if 
all predictors and the response are observed at that time and location. To allow for model 
fitting in spite of missing data, all of the following conditions must be met: 


1. Ateach location, observations need to be complete for at least one sequence of at least 
L +2 consecutive time points. 


2. Ateach location s;, for any pair of locations s;, s;, s;# s;, observations must be 
complete at both locations simultaneously for at least two sequences of L + 2 
consecutive time points. 
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3. Overall, at least L sequences of at least L + 2 consecutive time points must be 
present in the data (to allow for estimation of @). 


4. The total number of complete samples must be at least equal to D + L + 2, where D 
is the number of predictors, including the intercept, and L the user-specified lag. 


5. After removing locations according to the rules above, no more than 5% of the 
remaining records should be incomplete. As an example, if after removing locations, 
n locations and m time stamps remain, no more than n X m X .05 records should be 


incomplete. 


The above conditions should be verified in the following order: 


Step 1. Remove locations that do not meet condition 1. 


Step 2. Remove locations that violate condition 2 in the following order: 


(a) Let J be the set of points that violate condition 2. 


(b) Eliminate from the data set the observation(s) that violate condition 2 for the 
greatest number of pairs. In case of a tie, remove all observations that are tied. 


(© Update J by removing any observations that now no longer violate Condition 2. 
That is, remove observation that only violated the condition 2 in a pair with the 
observations that were removed in Step 2b. 


() Iterate steps 2b and 2c until J isempty. 


Step 3. If after Steps 1 and 2, conditions 3-5 are violated, the model cannot be fit. 


2 Model 


2.1 Notation 


The following notation is used for the model inputs: 


Name Symbol Type Dimensions 
Number of time stamps m>L integer | 1 
Number of measurement locations n>=3 integer | 1 
Number of prediction grid points N integer | 1 
Number of predictors (including intercept) D integer | 1 
Index of time stamps t € {1,...,m} integer | 1 
Spatial coordinates § 6 {51, 0 Sab Sp = 05.0}, wi). vector 3x1 
Targets observed at location s and time t Y:(s) scalar 1 
Targets observed at location s Y(s) vector mxX1 
Targets observed at time t Y; vector nx1 
Predictors observed at location s and time t Xi (S) = Ci) Xen(8))- vector DXx1 
Predictors observed at location s X(s) = (X1(s), ., X (s)) matrix mx D 
Predictors observed at time t Xp = (Xi(s)), ) Xi(8 )) matrix nx D 
Maximum autoregressive time lag L>0 integer | 1 
Length of prediction steps H>0 integer | 1 
Notes 

i. Fora predictor that does not vary over space, X¢q(S1) = Xta(S2) =" = Xea(Sn)3 


ii. Fora predictor that does not evolve over time, X14(s) = X2¢(s) =" = 


Xma(s)- 
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The following notation is used for model definition and computation: 


Name Symbol Type Dimension 
Coefficient vector for linear model B=(B1,..,Bp) vector D 
Coefficient vector for AR model a= (ay, .., A) vector L 
Vector of 1’s Lai al). vector variable 
Kronecker product ® operator NA 


2.1 Model structure 


D 
¥2(s) = DBaXza(s) + Zr(S) (1) 
d=1 
where Z;(s) is mean-zero space-time correlated random process. Users can specify whether 


an “intercept” term needs to be included in the model. The inference algorithm works with 
general “continuous” variables, and with or without intercept. 


e Autoregressive model, AR(L) for time autocorrelation (Brockwell and Davis, 2002): 
L 


Z(s) = Mai Zr-(s) + €(S) (2) 
l=1 


Note that users need to specify the maximum AR lag L. 


Let €,= (€;(s1), -., €¢(Sn)) be the AR residual vector at time t. Since the time 
autocorrelation effect has already been removed, €7+1, ...,€m are independent. 


e Parametric or nonparametric covariance model for spatial dependence: 
V(e,) =X5,t=L+1,...,m (3) 
where Xs = {R(Sp S;)}i jetm isa n Xn covariance matrix of spatial covariance functions 
R(s, 5s) = Cov(Y;(s), Y¢(s )) at observed locations. Two alternative ways of modeling the 
spatial covariance function R (sj, s;) are implemented - a variogram-based parametric model 
(Cressie, 1993) and a Empirical Orthogonal Functions (EOF )-based nonparametric model 
(Cohen and Johnes, 1969; Creutin and Obled, 1982). 


Note that users can specify which covariance model to be used. 
e Ifa “parametric model” is chosen, the algorithm will automatically test for the 


goodness-of-fit. If the test suggests a parametric model is not adequate, the 
algorithm switch to EOF model fitting and issue prediction based on EOF model. 


e Ifa EOF model is chosen, the switching test part will be skipped, and both 
model fitting and prediction will follow EOF-based algorithm. 


Under this model decomposition, the covariance structure for the spatio-temporal process 
Y = (Yi41,-+»Ym)’ is of separable form 


V(Y) =V(Z) =Z=27 W Xs (4) 


jaeiy: 


matrix with the autocovariance function. 


3 Estimation algorithm 


This section provides details on the multi-step procedure to fit the STP model (see Figure 1) 
when the user specifies a “parametric model”. If an “empirical model” is specified, the 
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switching test part will be skipped, and both model fitting and prediction follows EOF-based 
algorithm. 


1 OLS fit 
> AR fit to OLS residuals 


Estimate empirical 
spatial covariance 
matrix 


Estimate parametric 
spatial covariance 
model 


Goodness of Fit tests 


Calculate spatial 
covariance matrix based 
on parametric model 


Use empirical spatial 
covariance matrix 


Refit AR model 


Fit GLS model 


Display and save results for prediction 


Figure 1. Flowchart of algorithm steps for model fitting when a “parametric model” is specified. 


Step 1: Fit regression model by ordinary least squares (OLS) regression using only 
observations that have no missing values (see Section 3.1). 


We first ignore the spatio-temporal dependence in the data and simply estimate the 
fixed regression part by OLS and obtain the regression residuals Z;,(s). 


Step 2: Fit autoregressive model using only data without missing values (see Section 3.2). 


Ignoring spatial dependence in OLS residuals Z;(s), we estimate autoregressive 
coefficients by fitting the regression model (2) and obtain the AR residuals e;(s). 


Step 3: Fit spatial covariance model and test for goodness of fit on data without missing 
values (see Section 3.3). 


We fit a parametric spatial covariance model. We perform two Goodness of Fit tests 
to decide whether to continue with the parametric covariance model or the empirical 
covariance matrix. 


Step 4: Refit autoregressive model using augmented data (see Section 3.4). 


We refit autoregressive model accounting for spatial dependence by generalized least 
squares (GLS) and obtain improved AR coefficients a. 
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Step 5: Refit Regression model using augmented data (see Section 3.5). 


We obtain improved regression coefficients 6 by GLS to account for spatio- 
temporal correlation in the data. 


Step 6: Save the results for use in output and prediction. 


3.1 Fit regression model 


We first ignore the spatio-temporal dependence in the data and simply estimate the fixed 
regression part by OLS. Assume that out of nm location-time combinations, gq samples have 
missing values in either X or Y. Let Y=(Y,...,Y),a (nm -—q) X 1-vector and X = 


(X1', ...,Xm')’, a (mn — q) X D matrix, such that X and Y contain only complete 
observations, i.e., observations without any missing values. The OLS estimates of the regression 
coefficients are: 


B = (X'X) 7 X'Y (5) 
The residuals are: 
Z=Y—KXB. (6) 
3.2 Fit autoregressive model 


We estimate autoregressive coefficients by OLS assuming no spatial correlation and AR(L) 
as model for time-series autocorrelation, 


Zz = i Zed se 7% + Et, (7) 


where Z, isa n; X 1 vector. Note that due to the existence of missing values, the number 
of locations n; varies among different time points. Moreover, for each time points t, only 
locations with no missing values at L + 1 consecutive time points, i.e., (¢,¢ —1,...,¢—L) 
can be used for model fitting, therefore, )7" 741: < [n(m — L) — q]. 


Step 1: Construct n; X L time lag matrix 


a a a ' aA a A> P 
Step 2: Let Ziag = (Zis1-tagr+»Zm-—tag) and Z* = (Zj41,..,Zm) . Solve the linear 
system 


(ZiagZiag )& a ZiagZ” (9) 
which is equivalent to solving 


m m 
(> ee b2 Le tugs (10) 


t=L+1 t=L+1 


using the sweep operation to find estimate @. 


Step 3: Compute the de-autocorrelated AR(L) residuals 


é,=2,—2,2,., —-—@,2Z,_,,t =L+1,..,m (11) 
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3.3 Fit model and check goodness of fit for spatial covariance 
structure 


We explicitly model the spatial covariance structure among locations, rather than using 
variogram estimation. 


Under the assumption of the model (stationarity, AR-relationship removed), the mean of the 
residuals is 0 at all locations. We therefore estimate the unadjusted empirical covariances sjj 
and correlations rj; assuming mean 0, i.e., 


Ic... 
S= [Sis], joan SU = aba é(sér(S) (12) 
t 


where tj; is the number of complete residual pairs between locations s; and sj, and t 
indexes these pairs, i.e., the time points for which both €,(s;) and €,(j) are non-missing. 


V5uSii ~) 


‘ij = 


To determine whether to model the spatial covariance structure parametrically or to use the 
nonparametric EOF model, we perform the following two tests sequentially: 


1. Fit parametric model to covariances using the parameter vector wp = (07, 0,T7) 
(Cressie 1993) 


a ~\D . ; ; 
Cov(€;(s;), €¢(s;); P) = oe ), if hij > 0; (14) 


ee oe aa otherwise. 


where hj = ||si— || 5 is the Euclidean distance between locations s; and s;. Users 
need to specify the values for the order parameter p. 


p € [1, 2] is a user-defined parameter that determines the class of covariance models 
to be fit. p = 1 corresponds to an exponential covariance model, p = 2 results in a 
Gaussian covariance model and p € (1, 2) belongs to the powered exponential 
family. 


Next, determine if there is a significant decay over space by testing Hp: — 1/0? = 0. 
If we fail to reject Ho, we conclude that the decay over space is not significant, and 
EOF estimation will be used. If EOF estimation is used, there is not need to calculate 
8, o or T, as we have concluded that they are invalid descriptions of the covariance 
matrix. In fact, there may not be valid solutions for these parameters, therefore they 
should not be estimated. 


2. Ifthe previous test rejects Ho, test for homogeneity of variances among locations: if 
homogeneity of variances is rejected, EOF estimation will be used. Otherwise, the 
parametric covariance model will be used. 


3.3.1 Fit and test parametric model 


a) Enforce a minimum correlation of +.01: if rj; <.01, set sj =.01Vsjsj and rjj = .01. 


b) Let s be the vectorized lower triangular of the covariance matrix (excluding the diagonal, 
i.e., excluding variances), r be the vectorized lower triangular of the correlation matrix 
(excl. diagonal), and h the corresponding vector of pairwise distances between the n 
locations. s,r and h are each vectors of length n(n — 1)/2. 
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9) 


Define g = — 1/6?. Fit the linear model In s = Ino? + ph? using a GLS fit: 
A =[1,h?] (15) 
1 
VS xT (Bt- cbb')T (16) 


whereb = 2r?/(1 — r?), r2 is obtained by squaring each element of vector r, B™* = 
diag(b), and scalar c = 1/(1 + 1'B~/1). Also, let T = diag[vt;], k = 1, ..., n(n — 1)/2, 
where t;,is the number of pairs of de-autocorrelated residuals in the calculation of the 
corresponding element r; in r, i.e., the number of observations pairs that went into 
calculating r;,, which may be different for each entry of the covariance matrix, depending 
on missing values. Note that t;, corresponds to the vectorized lower triangular of 


ltl pet on? where t,; are as defined in (12). 


Let 7 = (In o’, ¢), the GLS estimator can be calculated as 
° peta gigi 
N= (AV “A) AV Ins 


The standard error for “7 will be se(#)= Vdiag[(A’V-1A)~1]. 


ge 
ee. baer cy ; ‘= 
Calculate the test statistic “*  se@), If z; > Zs, where Zs is the .05 quantile of the 


standard normal distribution (or critical value for selected level of significance y,), then 
all following calculations will be performed using the empirical spatial covariance matrix, 
i.e., Xs = S, and the nonparametric EOF model will be used for prediction. Equivalently, 
a p-value p, can be calculated by evaluating the standard Normal cumulative distribution 
function (CDF) at 2; (i.e., p; = P(Z < Z,)). If p, = level of significance y,, then all 
following calculations will be performed using the empirical covariance matrix. 


If the previous test does reject Ho (i.e., we have not yet decided to continue with the 
empirical covariance matrix), continue to perform the following test: Let v = 

(S41, $22, «+» Snn) be the (n x 1)-vector of location-specific variances. Calculate the 
weighted mean variance 0 


D= 1'W-1v/(1'W-11) = 'w-tv/) Wij (17) 
ij 


where W = [wij] = [si /tij| 


and W =[w* ] , 
Ui j=l,..n 


Gaya isan n Xn matrix, where ¢;; is defined as in (12), 


Calculate the test statistic z) = (v—v)'W ‘(v —v). If z. > 2X, -1,95 (or critical value for 
[1 — selected level of significance y2]), all following calculations will be performed 
using the empirical spatial covariance matrix, i.e., Xs; = S, and the nonparametric EOF 
model will be used for prediction. Equivalently, one may compute a p-value p, by 
evaluating 1 minus the y2_, — CDF: p, = P(y?-1 > 22). If p < level of significance y,, 
then all following calculations will be performed using the empirical spatial covariance 
matrix. 


If the two tests in b) and c) do not indicate a switch to the EOF model, all following 
calculations will be performed using the parametric covariance model, i.e., the spatial 
covariance matrix Xsis constructed according to (14). Recall that 7 = (In 02, — 1/6). 


ee : ‘ ay 1 aS 
The missing parameter Tt? is derived as T2 = max}0,-¥,_ Ss; —exp|Ino? |}. 
= n i—1,_..2 
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3.4 Re-fit autoregressive model 


We refit the autoregressive model accounting for spatial dependence using GLS with 
augmented data: 


Step 1: 


Step 2: 


Step 3: 


Step 4: 


Step 5: 


Compute the Cholesky factorization X£; = H sH and the inverse matrix H. 


Substitute 0 for missing values such that Z t-lag,impute is an n X L matrix and 


Ztimpute is a vector of length n. 


Augment predictor matrix as follows. Let 4 lag,impute = 


' ' 


am = , ss ss _ 
(2i33-tpiaman ae Pee bea n(m 7 L) x L matrix and Zimpute _ 


’ ’ 


~ One 
(z rey er a peer is a vector of length n(m — L), then 


y a | ey eee rey 
where Izmissis an(m — L) X qz indicator matrix given qz the total number of rows 
with missing values in either Zor Zig g: If there is a missing value in the ith row of 
either Z*or Z lag» and if this is the jth out of all qz rows that have missing values, 
then the jth column of Iz is; is all 0 except for the ith element, which is set to 1. 
Remove the spatial correlation: | = H;'Z t-lag,aug and ae = 
HZ. impute Where ) ae g,aug are the submatrices of Z lag,aug that correspond to the 


rows of the matrices Z,_ bia teite 


Use the same computational steps as for the linear system in equation (10) to solve the 
linear system 
m 


m 
~!T ~ = 4 ~ 
( >». Te saalama Sst = 2 t-lag,aug“t,impute (18) 


t=L+1 t=L+1 


where @ gy, is a vector of length L + qz, and there are L*+ q* ngn-redundant 
parameters in above linear system. The AR coefficient estimate @ is the subvector 
consisting of the first L elements of “@g,,,, there are L* non-redundant parameters in 
first D elements of @,,g. and qz non-redundant parameters in last qz elements of 


Bing: 


3.5 Re-fit Regression model 


Refit regression model by GLS using augmented data to account for spatio-temporal 
correlation in the data. 


Step 1: 


Step 2: 


Substitute the following for missing values such that Ximpure is anm X D matrix 
and Yimpute is a vector of length nm: at location s;, use the mean of Y(s;) and the 
mean of each predictor in X(s;). 


Augment predictor matrix as follows. 


Xaug = (X srepuitien Tymiss) 


where Iy miss is am X q indicator matrix given q the total number of rows with 
missing values in either X or Y. If there is a missing value in ith row of either X or 
Y, and if this is the jth out of all q rows that have missing value, then the jth column 
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of Ixmiss 1S all 0 except for the ith element, which is 1. 
Step 3: Remove the spatial correlation: X taug = Hy*Xtaug and Y heiuts = Ae Vc ipute: 


Step 4: Remove the autocorrelation: 
Ritay = Feeeg — Cth ety 2 OR geet SbF 1 (19) 
Y: impute = Y cenpute = oe fa aon i) fa. t=L+1,...,m (20) 


Step 5: Solve the linear system 


er] ~~ ai _ 
Zag h wun Paes = XaugY impute (21) 


=! 


where Vimpute = (Vi+1imputer-»Ymimpute)'» an n(m — L) X 1-vector and Xaug = 
(Xissauge-»Xmaug)’. a n(m—L) xX (D + q) matrix, Bayg is a vector of length 

D + q. and there are D* + q* non-redundant parameters in above linear system. The 
regression coefficients estimate PB is the subvector consisting of first D elements of 
B aug: there are D* non-redundant parameters in first D elements of Bcags and q* 
non-redundant parameters in last q elements of Bayg- 


3.6 Statistics to display 


3.6.1 Goodness of Fit statistics 


We present statistics referring to the three main elements of the model: the mean structure, 
the spatial covariance structure, and the temporal structure. 


1. Goodness of fit mean structure model XP: 


Let Q be the set of observations (Y;(s), X;(s)) that have missing values in either 
Y,(s) or X;,(s). Note that q has been defined as the number of observations in Q. 


Calculate the mean squared error (MSE) and an R? statistic based only on complete 
observations: 


MSE = » (¥;(s) — ¥.(s)) /(mm —q-—D*) 


se€{sy,...Sn} (22) 
t=1,,...,.m; 
Y¥,(S)€O 


1— > (¥,(s) — ¥,(s))° > V.Gy., if there is no intercept 


SE(S4 p05} SE{S4,.5Sy}; 
1.77: 1.9m 
R2= ¥,(S)€O ¥,(S)EQ (23) 
1- - (¥,(s) — ¥,(s)) (¥%(s) — ¥,(s))?, if there is an intercept 
SE{S4 p-Su} SE[55 55535 
t=1.,,...,.m; t=1....,m; 
¥,(S)€O ¥,(S)€Q 


where Y,(s) = X{(s)B, D* is the number of non-redundant parameters of re-fitted 
regression in first D elements of — and Y,(s) is the mean of Y only on complete 
observations. Note that for this calculation the original (untransformed) observations Y 
and covariates X are used. Alternatively. we can calculate the adjusted R2 
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2. Goodness of fit for AR model: 


Present t-tests for AR parameters based on variance estimates in item 3 in Section 
3.6.2; 


3. Goodness of fit of spatial covariance model: 


Present the test statistics listed in item 5 in Section 3.6.2. 


3.6.2 Model and parameter estimates 


The following information should be displayed as a summary of the model: 
1. Model coefficients B, @ obtained in Sections 3.4 and 3.5 


2. Standard errors of elements of B based on V(B ). the covariance matrix of B, which is 
the upper D x D submatrix of V(Baug): 


m =—1 
V(Baug) = a a,” (Paigk on *( Mou Ssoe (25) 


t=L+1 


where 


es = *,2 — 
* SSe = le | _ (Viresate) ) = Fey (n(m 7 L) _ IVT iapute); 
- ‘a | is the predicted value based on estimated B. 
- fy 1s corresponding element of Yim pute in the correlation matrix of re- 
fitted regression after sweep operation, 
- n(m-—L) is number of transformed records used in equation (21) for re-fit 
regression , 
- and V(Yimpute) is variance of Vimpute- 
e df, =n(m-—L)-—p,and p = D* + q’ is the number of non-redundant 
parameters in re-fitted regression. 


Based on these standard errors, t-test statistics and/or p-values may be computed and 
displayed according to standard definitions and output scheme of linear models (please 
refer to linear model documentation): 


(a) For each element f; of B and the corresponding j-th diagonal element of V(B). 
j =1,...,D, compute the t-statistic t; = B;/ V(B),, 


(b) The p-value corresponding to t; is 2 x the value of the cumulative distribution 
function of a t-distribution with nm — q — D* degrees of freedom, i.e., pj = 2° 
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(1 = P(tnm—q—D* < |tj|)). 
Note that depending on the implementation of the GLS estimation in Section 3.5, 


ae =a ; : ‘ ; 
X am ‘ua a) may have already been computed, in which case this expression does 
not need to be recalculated. 


oy) 


. Standard errors of a based on V(@), the covariance matrix of @, which is the upper 
Lx L submatrix of V(@aug): 


m -—1 
SS. : sy “is 
V(@aug) = if x ( ae en (26) 
e 
t 


=L+1 
where 


: a ee = i aa 2 
* SS, = pt (ere > (Zissuis) ) = Fzg(n(m =i) = 2V (Zeusuts)s 
~ * ~ 
- (Z ny | is the predicted value based on estimated @ and Z;-jag,aug 
- fzzis corresponding element of Z; impute im the correlation matrix of re- 
fitted autoregressive model after sweep operation, 
- n(m-—L) is number of transformed records used in equation (18) for re-fit 
autoregressive, 
- and V(Ztimpute) is variance of Z;impute- 


e df =n(m-—L) — pap. and Par = L* + qz is the number of non-redundant 
parameters in re-fitted autoregressive model. 


Based on these standard errors, t-test statistics and/or p-values may be computed and 
displayed according to standard definitions and output scheme of linear models. 


(a) For each element a; of @ and the corresponding j-th diagonal element of V(@), 
j = 1,...,L, compute the t-statistic t; = aj;//V(@) jj 


(b) The p-value corresponding to t; is 2 xthe value of the cumulative distribution 
function of a t-distribution with }'7,n; — L’ degrees of freedom, i.e., pj = 2° 


(1 — P(tym nu S IGI))- 


4. Indicator of which method has been automatically chosen to model spatial 
covariances, either empirical covariance (EOF) or parametric variogram model. 


5. Test statistics from goodness of fit tests for parametric model: 


- Test statistic z;, p-value p,, level of significance y, used for automated test for fit 
of slope parameter 


- Test statistic 22, p-value po, level of significance yz used for testing homogeneity 
of variances 


6. Parametric covariance parameters if parametric model has been chosen 


3.6.3 Tests of effects in Mean Structure Model (Type ITI) 


For each effect specified in the model, type III test matrix L is constructed and Ho: L;B = 0 
is tested. Construction of type III matrix L as well as generating estimable function (GEF) is 
based on the generating matrix H, which is the upper D x D submatrix of 


(XougXaug) *XéaugXaug, such that L;f is estimable. It involves parameters only for the 


i 
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given effect. For type III analysis, L does not depend on the order of effects specified in the 
model. If such a matrix cannot be constructed, the effect is not testable. 


Then the L matrix is then used to construct the test statistic 
BEGLEY LB 
F = ———_—_—_- 
Tc 


where 


e B is the subvector of the first D elements of Bos, g obtained in Step 5 of Section 3.5, 

e r.=rank(L3L), 

e & is the covariance matrix of 2, which is the upper D X D submatrix of V (Bang) 
defined in equation (25). 


The statistic has an approximate F distribution. The numerator degrees of freedom df 1 is r, 
and the denominator degrees of freedom df2 is nm — q — D*, where D*is the number of 
non-redundant parameters in the first D parameters of refitted regression model obtained in 
Section 3.5. Then the p-values can be calculated accordingly. 


An additional test also should be computed, which is similar to “corrected model” if there is 
an intercept or “model” if there is no intercept in ANOVA table in linear regression. 
Essentially, the null hypothesis is regression parameters (except intercept if there is on) are 
zeros. The test statistic would be the same as the above F statistic except the L matrix is from 
GEF. If there is no intercept, the L matrix is the whole GEF. If there is an intercept, the L 
matrix is GEF without the first row which corresponds to the intercept. 


Statistics saved for Test of effects in Mean Structure Model (including corrected model or 
model): 


e F statistics 


e dfl 
e df2 
e p-value 


3.6.4 Location clustering for spatial structure visualization 


Large spatial covariance matrix or correlation matrix are not suitable to demonstrate the 
relation among the locations. Grouping method, also called community detection or position 
analysis (Wasserman, 1994), can be used to identify some representative location clusters. To 
simplify the implementation, hierarchical clustering (Johnson, 1967) is used to detect clusters 
among locations based on STP model spatial statistics. 


Please note location clustering is only supported when empirical nonparametric covariance 
model is used. 


Given a set of n locations {sj, ..., S,}in STP to be clustered, and their corresponding spatial 
correlation matrix R, an*n matrix, as the similarity matrix 
R= [rij 


ij=l,..n 
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Given similarity threshold @ with default value 0.2, and N- with default value 10, the 
process of location clustering is described in following steps, which is based on the basic 
process of hierarchical clustering. 


Step 1. Initialize the clusters and similarities: 


e Assign each location s;to a cluster C;(i = 1, ..., 2). So that for n locations, the total 
number of clusters nc = n at the beginning, and each cluster has just one location, 
e Define the set of clusters: C, 
e Define similarity matrix 
C 
Re = ey Nese 
where the similarity rij between the clusters C; and C; is the similarity rj; between 
location s; and sj. 


Step 2. Find 2 clusters C; and C; in C with largest similarity max(rjj), 
If max(r;) >a: 


e Merge C; and C; into a new cluster C(;;, to include all locations in C; and Cj, 
e Compute similarities between the new cluster C(,,;) and other clusters C;, , k# 


iand j 
ie . 4 aed oe ae 
Vi jy = min( Th Me) 
e Update C by adding C(;;), discarding C; and C;.So nc = nc — 1. 
e Update similarity matrix R° byadding r£, j>ie discarding rj, and rf, go to step 3. 


If max(ry) < a, go to step 4. 


Step 3. Repeat step 2. 


Step 4. For all the detected clusters with more than 1 location, compute following statistics: 


e Cluster size: nc, is the number of locations in C;, 


e Closeness: 1 y P dk<l 
a ——_—— T; iW Sr, S E ;,an =. 
i Ne,(Ne, 7 1)/2 kl kyl i 


Step 5. Define clusters for interactive visualization: 


© Ccloseness: Lhe first Nc clusters sorted by descending closeness d;, 
© Csize: The first Nc clusters sorted by descending cluster size n¢,. 
Step 6. Output the union for location cluster visualization: 
C* = Cotoseness U Csize 
Statistics saved for spatial structure visualization including: 
4. Number of excluded locations during handling of missing data 


5. Spatial correlation matrix R= [ri], Gina 


6. Statistics of each output location cluster in C*: 
« Closeness d; 
- Cluster size ne, 
e Coordinates of locations in this cluster 
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3.7 Results saved for prediction 
1. Model coefficients B, @ the covariance estimate V (f) as defined in (25). 


2. Transformed regression residuals and predictors of L most recent observations for 
prediction: 


== '-1p-1 R = 
Zm-1+1 =H Ss Hs (Ae _ Xvi-tstaugP aug)! = Bip (27) 


7 = '—1ypy-1 —_ fo) 
Xm-1+1,impute ~~ Hs Hs Xm-1+1,impute | = 1,...,L (28) 


3. Indicator of which method has been chosen to model spatial covariances, either 
empirical covariance (EOF) or parametric variogram model. 


. Parametric covariance parameters t if parametric model has been chosen. 

. Coordinates of locations s. 

. Number of unique time points used for model build, m. 

. Number of records with missing values in the data set used in model building, q. 


. Spatial covariance matrix Xs. 


Oo AN DUN ff 


. H51, inverse of Cholesky factor of spatial covariance matrix. 


4 Prediction 


We perform the following procedure to issue predictions for future time m + 1, ..., m+ H at 
prediction locations G = (gi, ..., gv) using the results saved in the output file (see Figure 2). 
The input data set format should include location G, predictors X fort =m+1,...,m+H. 
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1 Read saved results and input data (prediction locations 


G, predictors X for time m+1, ..., m+h) 


2 Use variogram-model? 


No Yes 
Estimate empirical Estimate parametric 
3 spatial covariance spatial covariance 


matrix model 


Figure 2. Flowchart of algorithm steps for model prediction 
4.1 Point prediction 
Step 1: Construct the N x n spatial covariance matrix to capture the spatial dependence 
between prediction grids g € G and original sample locations s. 
e If variogram-based spatial covariance matrix 


Vs(g) =V(er(g)) = 07 + 1° (29) 
and 


EKG) = {oul eC Gee SWE ak ie Sa (30) 
according to (14) for all locations g (whether locations were included in the model 


build or not). 
e If EOF-based spatial covariance function is used: 


For locations g; that are included in the original sample locations s, 
Coveor(€:(gi), €¢(s)) is equal to the row corresponding to location g; in the 
empirical covariance matrix 2's and Vs(g;) is equal to the empirical variance at 
that location, i.e., the diagonal element of 2's corresponding to that location. 
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For locations g; that were not included in the model build, calculate the spatial 
covariance in the following way: 


(a) Perform eigendecomposition on the empirical covariance matrix 
S=@A®' 
where ® = (@4,..., On) with ®, = (P;(s1), «.., Oe(Sn))’ is the n X n matrix 
of eigenvectors and A = diag(A, ..., An) is the n X n matrix of eigenvalues. 


(b) Apply inverse distance weighting (IDW) (Shepard 1968) to interpolate 
eigenvectors to locations with no observations. 


— wi(g)bx(si) 


(g)= ,k=1,...,n 
Pr g LZ 71 Wj(g) 


where 
1 


dist(g, s;)? 
is an Inverse Distance Weighting (IDW) function with p < d for d- 
dimensional space and dist(g, s;) may be any distance function. As a default 
value, use Euclidean distance with p = 2 and dist(g, s,)* = (g — si))'(g — 
Si). 

© The EOF-based spatial variance-covariance functions are 


wi(g) = 


Vs(g) = V(e(g)) =X And? (9) (31) 
k=1 
and 
Cov (€¢(G;), €t(Sj)) = X Anh, (GP, (S)) (32) 
k=1 


and the corresponding N X n spatial covariance matrix 


Cs(G) = {Coveor (€r(gv), €t(S)))}- eee (33) 
i=1,...,.N;j=1,...n 
Note that under the EOF model, we allow for space-varying variances. 


Step 2: Spatial interpolation to prediction locations g for the most recent L time units, 
Zm-L+1) Ana Zm 
Zm-1+1(G) = Cs(G)E5*Zyn—141 a C5@Zn_u1! = 1,...,L (34) 


where Zi, -1+1(G) is a vector of length N. 
Step 3: Iteratively forecast for future time m + 1, ...,m +H at prediction locations G. 
Zm+1(G) = Z(G) + + + &Zm-141(G) (35) 
Zm+2(G) = @Zm41(G) + ++ @,Zm_1+2(@) (36) 


Zm+u(G) = @1Zm+n-1(G) 7 er @1Zm+H-L (G) (37) 
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where Zim+1(G), h = 1,...,H are vectors of length N. 


Step 4: Incorporate predicted systematic effect 


Y¥insn(G) a Zm+u(G) 3 Xmn(G@B, a= 1, ... »H (38) 


where Yn44(G), h = 1,...,H are vectors of length N. 


4.2 Prediction intervals 


Under the assumption of Gaussian Process and known variance components, the prediction 
error Yn+4(Gi) — Ym+n(gi) comes from two sources: 


e The prediction error that would be incurred even if regression coefficients B were 
known. 


e The error in estimating regression coefficients B 


The variance of prediction error is thus 


V [Prt (Gi) _ Ym+n(9i)] 
— (X' men (Gi) ~ C' man(G=*X eepute V(B)(X' min (Go ~ Cen Oe Ek ute) (39) 
+Vin+n(Gi) — Cm+n(GiE*Cm+n(gi) (40) 
Expression (39) arises from the variance expression for universal kriging, while (40) is 


the variance of a predicted random effect with known variance of the random effects 
(McCulloch et al. 2008, p.171). 


© Cmin(gi) = Crim + h) & Cs(gj) is the covariance vector of length nm between te 
prediction Yn4,(g;) and measurements Y;(s), ..., ¥m(s). Note that Cp(m +h) = 
{yr(@m + h — t)}t=1,..m is the AR(L) covariance vector of length m and Cs(g;) = 


{Cov (Y:(gu), Ye(s;))} eae is the spatial covariance vector of length n. 


e The nm X nm covariance matrix 2 is defined as to XY = Yr ® Xs and 2p; = 
{yrlt — t’Dec'=1,...m- Note that Zs is a quantity stored after the model build step. 


Vinen(Ga) =VYm+n(ga) = Yr(0)Vs(gi) is the variance of Yin+n(gi)- 


e Note that expressions (39) and (40) are not computed explicitly, but instead are 
implemented as described in the following. 


Computational process: 


Step 1: Compute the error in estimating regression coefficients f in (39). 


For |= 1, ..., Z, interpolate X to prediction locations g for the most recent L time 
units 


Pm41-1(Gi) = X'n+1-Limpute&s Cs(gi) = X' m+1-LimputeCs (Gi) (41) 


where P»+41-1(gi) is a vector of dimension D x 1. Define 
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Step 2: 


Pmin-i(giv, ifh—1l<0; (42) 


Xm+n-i(Gi) = Sas otherwise. 


Fort =m—L+1,...,m(h< 1D), we only have X at sample locations s, so 


X.(9:) = P:(g;), the interpolated values from X;,(s); for t>m (or h>1), we 
already input X at prediction locations g, so there is no need to interpolate and 


Xz(gi) = Xz (gi). 


Then, for h= 1, ..., H, recursively compute the D X 1 vectors Wnin(gi) 


L 


Wmrsn (90) = Xman(Gd +), & Wmasn-1(G0) — Lmen-1(G0) (43) 
l=1 
where 
om _ (0, if h —1 < 0;(7) ” 
Wm+n-6Gi) = LWren-u(0 otherwise. ie 


The prediction error in estimating f, that is, expression (39) is thus 


W' m+n(GiV(B)Winen (91) (45) 


where V(f) is computed in (25). 


Compute the prediction error that would be incurred if regression coefficients were 
known, i.e., equation (40). 


¢ Compute C;(m + h) by AR(L) autocovariance function y;(k) (McLeod 1975). 
First, compute y7(0), ..., ¥r(L) by solving a linear system AX = b, 


1 ay —a, i se a yr(0) 1 (46) 
= La ~a, . —% 0 yr(1) 0 
—a, —(@, + @3) 1-@, i 0 0 Yr(2) 0 
—a3 —(@2 + @4) —(@,+@s5)... 0 0 Yr(3) 0 
—@2 —(@32+@1) —(@4+@) .. 0 0 Yr(L — 2) 0 
—@,_1 —(@,-2 + @,) —G,_3 1 0 yr(L — 1) 0 
4, ~A,-1 ee ae yr(L) 0 


Note that the first element of the vector on the right hand side (the variance of the 
measurement error) is fixed to be one, to account for the normalization through the 
spatial variance-covariance structure. 


Fork =L +1,...,m+H — 1, recursively compute 


¥r(k) = @yr(k — 1) ++ + Giyr(k — L) (47) 
Remark: To construct the (ZL + 1) xX (L +1) matrix A, 
_ (lel, j= Ui=1.b+1 | 
Aij = Ee — [@+4j-2], j=2,..,L4+1;i=1,..,L 41. = 


where 
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—1, k=0; 
[a,] =40, k<Oork>L; (49) 
a, Oc k= L. 


¢ Compute the approximated factorization of x such that RR ~ Daan where R is 


a (m — L) X m matrix (follows from Cholesky or Gram-Schmidt orthogonalization, 
see for example Fuller 1975): 


 « =e F 0 0 
0 —-@, .» -—m& ft 0 
0 —@, ane —@, 1 


¢ Compute the value of expression (40): 


¥7(0)Vs(g;) — (C’p(m + h) @ C's(g;))(R'R @ Hs? Hs1)(Cr(m + h) @ Cs(g,)) 
(51) 


where Cs(g;) is athe row of Cs(G) corresponding to location gj. 


Step 3: The (1 — a%) prediction interval is 


Ynen(9i) + tam—q-p",a/2,|V[¥m+n(9i) _ Ym+h (9i)] (55) 


where V[¥m+n(gi) — Ym+n(gi)] is the sum of equations (39) and (40) as computed in 
expressions (45) and (51), respectively. tnm—q—p,a/2 is defined as P(X < 
tnm—q—D*,a/2) = 1— a/2 where X follows t-distribution with degree freedom nm — 

q — D*. The default value for @ is 0.05. 


As final output from the prediction step, point prediction, variances of point predictions and 
prediction interval (lower and upper bounds) are issued for each specified (location, time). 


We remark that to perform what-if-analysis, a set of X variables under the new settings need 
to be provided. Then we re-run the prediction algorithm described in Section 4 to obtain 
prediction results under adjusted settings. 
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SPCHART Algorithms 


Nine types of Shewhart control charts can be created. In this chapter, the charts are grouped 
into five sections: 


m X-Bar and R Charts 

X-Bar and s Charts 

Individual and Moving Range Charts 
p and np Charts 


u and c Charts 


For each type of control chart, the process, the center line, and the control limits (upper and 
lower) are described. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 95-1 
Notation 
Notation Description 
G Population standard deviation for measurements Y 
A Number of sigmas specified by the user.0 < A <9 
K Number of subgroups 
mi Number of units (samples) for subgroup i 
N Total sample size. equal to m1 +... +x 
Vij Measurement (observation) for the jth unit (sample) of subgroup 7 
Li nh; 
Mean of measurements for subgroup i, 7; = (3: Xi : [ni 
i=1 
Si nj 
Sample standard deviation for subgroup i. S? = (3° (x3 — x] /(ni — 1) 
i=1 
Ri Sample range for subgroup i. 
R; = max (vi1,..., Zin, ) — min (@i1,..., Lin, ) 
LCL Lower Control Limit 
UCL Upper Control Limit 


Weight 


Weights can be used when the data organization is Cases are units. 
m™ Each value for weight must be a positive integer. 
™ Cases with either non-positive or fractional weights are dropped. 


m When weight is in effect, n; is a weighted sum for all the units in subgroup i and x; and x 
are weighted means. 
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X-Bar and R Charts 


When X-Bar and R charts are paired, the sample range statistic R is used to construct the control 
limits for the X-Bar chart. 


Note: Subgroups whose sample sizes are less than the specified minimum value are dropped. 


Equal Sample Sizes 
Assume that n; = nfori = 1,..., /:. The process for the X-Bar chart is {#; :i =1,..., A}. The 
center line for an X-Bar chart is the grand mean statistic: 


K 
ne a 
v= F Li 


i=l 
and the control limits are 


LCL = 7 — AR/(d2(n)V/n) 
UCL = 7+ AR/(d2(n)V/n) 


where 


is the mean range statistic. The process for an R chart is{R; :i=1,..., 4}. The center line 
for an R chart is R and the control limits are 


LCL = max (R(1 — Ad3(n)/d2(n)), 0) 
UCL = R(14 Ad3(n)/d2(n)) 


The auxiliary functions are 


dg(n) = [ (1—(1— @(x))”" — (®(x))" de 


d3(n) = (2/ . / (1 — (®(x))” — (1 — ®(y))” + (®(x) — ®(y))")dydx — (aa(n))*) 


zt 1 oar 
P(z) = ——e" dy 
0 V2 


Unequal Sample Sizes 


i 
2 


The processes for X-Bar and R charts are the same as described in the section “Equal Sample 
Sizes” above. The center line for an X-Bar chart is the grand mean statistic (numerically identical 
to that in the section “Equal Sample Sizes”): 
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and the control limits for subgroup i are 


LCL = % — AG/ nm 
UCL = 7+ A6/\/im 


The center line for an R chart for subgroup i is R; = cd2(n;) fori =1,..., 4. where 
K 

c= E>O(R ;/do( 
i=l 

and the control limits for subgroup i are 

LCL = max (R; — Agd3(n;), 0) 

UCL = R; + Acd3(n;) 


X-Bar and s Charts 


When X-Bar and s charts are paired, the sample standard deviation is used to construct the 
control limits for the X-Bar chart. 


Equal Sample Sizes 


Assume n; = n. The process for the X-Bar chart is {v; :i=1,..., A}. The center line foran 
X-Bar chart is = and the control limits are 


LCL = 7 — AS/(c4(n)/n) 
UCL = 7 4 AS/(ca(n)/n) 


The process for an s chart is {S; :i = 1,..., A}. The center line for an s chart is 


= K 
S= KD Si 


The auxiliary function is 


2 [T(n/2) 


c4(n) = V noi Mm=/) 


where I'(.) is the complete Gamma function. 


Note: When n > 25, c4(n),/n can be approximated by \/n — 0.5, (1 — (ce4(n ))”) /ea(n) can be 


approximated by 1/\/2n — 2.5, and cq(n) can be approximated by \/(4n — 5)/(4n — 3). 


a oe 
Unequal Sample Sizes 


The processes for X-Bar and s charts are the same as the processes in the section “Equal Sample 
Sizes” above. The center line for an X-Bar chart is % and the control limits are 

LCL = & — Aa//nj 

UCL =7 + Aa//ni 


or 

LCL = 7 — AS;/(ca(ni) /ni) 
UCL =74 AS; /(ca( nj) J/ni) 
where 


K 
: 1 a 
G= ) Si /c4(ni) 
i=1 


However, the center line for an s chart for subgroup i is S$; = oc4(n;) for i=1,...,K and the control 
limits are 


LCL = max ( S; — Ac (1-( (c4(n; ""),0) 


UCE = $43 (1 ae 


or 
LCL = ina ( (s — AS; (1-( (n; /ca(n; .0) 


UCL = $; + ASi4/ (1 a ) 


Individual and Moving Range Charts 


When a weight variable is specified, each unit of the process is expanded to multiple units 
based on the case weight associated with this particular unit. The span (specified by the user) 
is associated with the expanded process. If the span is greater than N (the total number of units 
of the expanded process), an error message is displayed and neither an Individual nor a Moving 
Range chart is generated. 

Since each subgroup has only one unit, the process for an Individual chart is 
{yi t= 1,...,.N} where y; is the ith unit of the expanded process. For a span of length m, 
the moving ranges, are 


R max (Yjom+1)-+-5 Yi) —~Min(Yi_mii,---, yi) ifi=m,...,N 
‘~~ ) SYSMIS ifi=—1,...,m—1 


The average moving range is 


The center line for an Individual chart is 7 and the control limits for an Individual chart are 
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LCL =z7-— AR/do( m) 
UCL = 7+ AR/d2(m) 


The process for a moving range chart is {/;, i = m,..., N}. The center line for a moving range chart 
is R. The control limits for a moving range chart are 


LCL = max (R(1 — Ad3(m)/d2(m)), 0) 
UCL = R(1 + Ad3(m)/d2(m)) 


p and np Charts 


The data for p and np charts are attribute data. Each measurement «r;; is either 0 or 1, where 1 
indicates a non-conforming measurement. Therefore, 


nr4 
Via = S ng 


j=l 


is the count of non-conforming units for subgroup i. When a weight variable is specified, x; is 
a weighted sum of non-conforming units. If the data are aggregated and the value of the count 
variable is greater than the total number of units for any subgroup, this subgroup is dropped. 


Equal Sample Sizes 


Assume n; = n. The process for a p chart is {p; :i=1,..., A} wherep; = 7;, /n. The center 
line for a p chart is 


K 
>= l e 
io zDD 
i=1 
and the control limits are 


LCL = max (7 —A JV (Pl | — p))in-0) 
UCL = min (p +A V (Pl 1 —p))/n. 1) 


The process for an np chart is {;; : i =1,..., A}. The center line for an np chart is 


i + Ss" Li4 
and the control limits are 


LCL = max (z —A Vnp(l —)), 0) 
UCL = min (z + Ay/np(1 —p), n) 


Unequal Sample Sizes 


The process for a p chart is {p; :i=1,..., A } where p; = 7;, /n;. The center line for a p chart is 
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K 1 K 


po AS ay Sm 
~ i=1 


i=l 


and the control limits for subgroup i are 


LCL = max (p — A,/(@(1 = p)) /7i, 0) 
UCL = min (p +A JV (p(t —p))/ni, 1) 


The process for an np chart is {;; :i=1,..., A } However, the center line for an np chart for 
subgroup i is n;p. The control limits for subgroup i are 


LCL = max (n,p — Amp —B)),0) 
UCL = min (nip +A vA nip(l —p)), ni) 


Note: A warning message is issued when an np chart is requested for subgroups of unequal 
sample sizes. 


u and c Charts 


Measurements «;; show the number of defects for the jth unit for subgroup i. Hence, 


ni 
Lp, = ) Vij 
j=l 


is the total number of defects for subgroup i. When a weight variable is used, 7; is a weighted 
sum of defects. 


Equal Sample Size 


Assume n; = n. The process for au chart is {u; :4=1,..., A } where u; = 7;, /n. The center 
line for au chart is 


K 
eee 
a= 230 u 
i=1 
and the control limits are 


LCL = max (a —A Vijn,0) 
UCL =%+ AVu/n 


The process for a c chart is {v;, :4=1,..., A}. The center line for a c chart is 


and the control limits for a c chart are 
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LCL = max (¢ — Av@,0) 
UCL =¢+ Ave 


Unequal Sample Size 


The process for au chartis {w; :i=1,..., A } where vu; = 2; , /n. The center line for a u chart is 


and the control limits are 


LCL = max (a —AV/fu/n,, 0) 
UCL=%u+A Ju in; 


The process for ac chart is {7}, :i=1,...,A}. The center line for subgroup i isn;w and the 
control limits are 


LCL = max (n;w@ — AV/njw, 0) 
UCL = n+ AVnjt 


Note: A warning message is issued when a c chart is requested for subgroups of unequal sample 
sizes. 


Statistics 


This section discusses the capability and performance statistics that can be requested through 
SPCHART, and uses the following notation. 


Table 95-2 

Notation 

Notation Description 

x the total sample mean. 

s the total sample/process standard deviation. 

o the estimated sigma in the Process Capability Indices. 

Ho the nominal or the target value, given by the user. 

LSL the lower specification limit, given by the user. 

USL the upper specification limit, given by the user. 
Assumptions 


m= The process is in control. ( and s are finitely estimated.) 


m= The measured variable is normally distributed. 
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Prerequisites 


m™ For the Process Capability Indices except CpK and the Process Performance Indices except 
Ppk, both LSL and USL must be specified by the user, satisfying LSL < USL. For CpK and 
PpK, at least one of LSL and USL must be specified by the user. 


m A target value j., such that LSL < yj, < USL must be given by the user for CpM and PpM to 
be computed. 


Process Capability Indices 
The estimated capability sigma ¢ may be computed in one of four ways. 


(1). If it is to be based on the sample within-subgroup variance, 


k ks 
anp> (tig= °/S 2 (ni = 1) 


i=1 j=l i=1 


(2). If it is to be based on the mean range, 


3 = 
j dy (nj 
6= i=1 ; ) 
ee tod 2 jo 
where dz (n;)= f 1—(1—©®(a))" —(®(a))"de with®(r)= ff =e“ "du 
—0o oo V2" 


Note that n; may or may not be equal for different subgroups. If they are all equal, we may write 


a Fee 
da(nj) (nj) 


a= 
where R = | R;/k, the mean range. 
(3). If it is to be based on the mean standard deviation, 
ser 
c 


i=1 


; 1 ( 
Oe A 


where c4(n;) = ,/ 4 i) 


=A) aed To 7a With the complete Gamma function I"(). 


Note that n; may or may not be equal for different subgroups. If they are all equal, we may write 


(4). If it is to be based on the mean moving range, 


+ MR; 
a dy (m) Ta 


n—m+1 ~~ do(m) 


= 
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where 


VR max (Yjmdasoss Ui) = Min es 4 Yel SH ey ee H 
' sysmis, ifi = 1, ..., m—1 


and n is the total sample size, m is the user-given length of span, and ./ Ff; is the ith moving 
range for the data. 


All of the capability indices, except K, require c, and in order to define them, we must have o >0. 


CP: Capability of the process 
cp = USL-LSL 


CpL: The distance between the process mean and the lower specification limit 
scaled by capability sigma 
CpL = = ish 


Cpu: The distance between the process mean and the upper specification limit 
scaled by capability sigma 


K: The deviation of the process mean from the midpoint of the specification limits 


__ 2|(USL+LSL/2-7| 
K — 
USL-LSL 


Note this is computed independently of the estimated capability sigma, so it does not need to 
be greater than 0 or even specified. 


CpK: Capability of process related to both dispersion and centeredness 
Cp = min (Cpu, CpL) 


If only one specification limit is provided, we compute and report a unilateral CpK instead of 
taking the minimum. 


CR: The reciprocal of CP 

a 

CP 

CpM: An index relating capability sigma and the difference between the process 


mean and the target value 


Cym = -USLALSL 


6\/o?+(F—- po )? 
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{io Must be given by the user. 


Z-lower (Cap): The number of capability sigmas between the process mean and the 
lower specification limit 


Cz, = #-LSL 


é 


Z-upper (Cap): The number of capability sigmas between the process mean and the 
upper specification limit 


Cie = USL-2 
Z-min (Cap): The minimum number of capability sigmas between the process mean 
and the specification limits 
CZmin = min(CZy,CZ_) 


Note that unlike CpK, this index is undefined unless both specification limits are given and valid. 


Z-max (Cap): The maximum number of capability sigmas between the process mean 
and the specification limits 


CZ max = max (C'Zy,CZ_) 


Note that unlike CpK, this index is undefined unless both specification limits are given and valid. 


The estimated percentage outside the specification limits (Cap) 
(1 —@(CZy) + ®(—CZ_)) x 100% 


where ®(.°) is the cumulative distribution function of the standard normal distribution. 


Process Performance Indices 


The estimated performance sigma is always the process standard deviation s. None of the indices 
in this chapter is defined unless s>0. 


PP: Performance of the process 


pp = USL=LSL 


PpL: The distance between the process mean and the lower specification limit 
scaled by process standard deviation 


PplL= aise 
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PpU: The distance between the process mean and the upper specification limit 
scaled by process standard deviation 


PpU = USL=s 


PpK: Performance of process related to both dispersion and centeredness 
PpKkK =min(Ppu, Ppl) 
If only one specification limit is provided, we compute and report a unilateral PpK instead of 
taking the minimum. 
PR: The reciprocal of PP 
PR= pp 


PpM: An index relating process variance and the difference between the 
process mean and the target value 


PpM = ~VSLoLsh. 


6y/s?+(F-p10)? 


//o Must be given by the user. 


Z-lower (Perf): The number of standard deviations between the process mean and 
the lower specification limit 


PZ, = z-LSL 


Z-upper (Perf): The number of standard deviations between the process mean and 
the upper specification limit 


Pri Ustes 
Z-min (Perf): The minimum number of standard deviations between the process 
mean and the specification limits 


PZmin = min(PZy,PZ_,) 
Note that unlike PpK, this index is undefined unless both specification limits are given and valid. 
Z-max (Perf): The maximum number of standard deviations between the process 


mean and the specification limits 


PZmax = max(PZy, PZ_) 


Note that unlike PpK, this index is undefined unless both specification limits are given and valid. 


SPCHART Algorithms 
The estimated percentage outside the specification limits (Perf) 


(1 — (PZ) + &(—PZ,)) x 100% 


where (.;) is the cumulative distribution function of the standard normal distribution. 


Measure(s) for Assessing Normality: The observed percentage outside the 
specification limits 


This is the percentage of individual observations in the process which lie outside the specification 
limits. A point is defined as outside the specification limits when its value is greater than the 
USL or is less than the LSL. 
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SPECTRA Algorithms 


SPECTRA plots the periodogram and spectral density function estimates for one or more series. 


Univariate Series 


For all t, the series X, can be represented by 


qd 
X, = ah + S© (aje cos 2m f(t — 1) + by sin 27 fx (t — 1) 
K=1 
where 
$A, hag 


9 N 
H hy 
aK N 


f 


(X; cos 27 f(t — ») 


N 
he ny | (X-sin 27 f(t — ») 
t=1 


ie — 

n= K/ N 

q= J N/2, if VV is even 
(N —1)/2, if N is odd 


The following statistics are calculated: 
Frequency 


fe a NAGS Tc 


Period 


Ue iy SG 


Fourier Cosine Coefficient 


Geto Tvsaag 
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Fourier Sine Coefficient 
bic = (ake — Di ake + HK) 


Periodogram 
lie = [(ak)? + ORY ]N/2, 8 = 1.4 


spectral density estimate 
p 
Sk = S wylix 4 j, where 2p + 1 = m (number of spans) 
J=P 
and 


I" K Ke =1,...,4 


ah — Js 
b= 4 


lie = 'nai_x for K > q 


Wop, Wop l,y+.+, Wy, W1,..., Ww, are the periodogram weights defined by different data windows. 


Bivariate Series 


For the bivariate series X; and Y; 


q 
X¢ = ag + » (ax cos2a7fet+bysinIafxt) t=1,...,4 N 
K=1 
q 
y= an + DS (ap cos 2a ft + bi. sin 27 fit ) 
K=1 


Cross-Periodogram of X and Y 


Ie : (ak — ibt-) (af. + ibh) 
~ FU (GK a + OK) + aKa Va) 


Real 
IN 55. A nh 
(RC) x = > (tk + birdy) 


Imaginary 


Noo - 
OK = 9 (tk bik _ bay) 


Cospectral Density Estimate 


P 
CK = >> w(RC) K 4; 


jJ=-P 


Quadrature Spectrum Estimate 


Pp 


Qe = >) wi); 


J=—-P 
Cross-amplitude Values 


5) 5 \1/2 
AK = (OK te Ck) 


Squared Coherency Values 


eae a 


Gain 
Values 


Cr = Ax/sig (gain of Y; over X; at fx) 
7K Ax/s (gain of X; over Y; at fx) 


Phase Spectrum Estimate 


tan-'(QK/Ck) 


YK =) tan \(Qx/Cx) +7 


if QK > 0,CK >0 

QK <0,CK >0 
if Qk >0,CK <0 
tan '(Qx/CkK) —n ifQx <0,CK <0 
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Data Windows 


The following spectral windows can be specified. Each formula defines the upper half of the 
window. The lower half is symmetric with the upper half. In all formulas, p is the integer part 
of the number of spans divided by 2. To be concise, the formulas are expressed in terms of 
the Fejer kernel: 


q CS 2a eat: 
Fy(@) = 1 (sintgsi2))’ otherwise 
qd \ sin(@/2) 


and the Dirichlet kernel: 


sin((20+1)4/?) otherwise 
sin (0/2) 


2qgt+1 6=0, +20, +4n,... 
D,(@) = 


where q is any positive real number. 


HAMMING 


Tukey-Hamming window. The weights are 


Wy, = 0.54Dp(2m fx) 4 0.28D,( ni | “) 0.28, (2m — =) 
p 


fork =0,..., p. 


TUKEY 


Tukey-Hanning window. The weights are 
W, = 0.5D,(27 f,) + 0.25Dp (27 + = + 0.25Dp (27 = = | 
fork =0,..., p- 


PARZEN 


Parzen window. The weights are 


1 nalts gin t 
= AG + cos (27 fx))(Fyja(2a fe)” 


W, 


fork =0,..., p. 
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BARTLETT 


Bartlett window. The weights are 


DANIELL UNIT 


NONE 


Daniell window or rectangular window. The weights are 
W,=1 


fork =0,..., p- 


No smoothing. If NONE is specified, the spectral density estimate is the same as the periodogram. 
It is also the case when the number of span is 1. 


W_p,-.+,Wo,---,Wp 


User-specified weights. If the number of weights is odd, the middle weight is applied to the 
periodogram value being smoothed and the weights on either side are applied to preceding and 
following values. If the number of weights are even (it is assumed that JV, is not supplied), the 
weight after the middle applies to the periodogram value being smoothed. It is required that the 
weight Vo must be positive. 
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SURVIVAL Algorithms 


Although life table analysis may be useful in many differing situations and disciplines, for 
simplicity, the usual survival-time-to-death terminology will be used here. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Xx; Time from starting event to terminal event or censoring for casej 
wy Weight for case j 

k Total number of intervals 

ti Beginning time for ith interval 

hj Width of interval i 


Ci Sum of weights of cases censored in interval i 


d; Sum of weights of cases experiencing the terminal event in interval i 


Construction of Life Table (Gehan, 1975) 


The following sections detail the construction of the life table. 


Computation of Intervals 


The widths of the intervals for the actuarial calculations must be defined by the user. In addition to 
the last interval specified, an additional interval is automatically created to take care of any times 
exceeding the last. If the upper limits are not in ascending order, a message is printed and the 
procedure stops. If the interval width does not divide the time range into an integral number of 
intervals, a warning is printed and the interval width is reset so that the number of intervals will be 
the nearest integer to that resulting from the user specification. 


Count of Events and Censoring 


For each case, the interval i into which the survival time falls is determined. 


If X; exceeds ¢;., the starting time for the last interval, it is included in the last interval. The status 
code is examined to determine whether the observed time is time to event or time to censoring. 

If it is time to censoring, that is, the terminal event did not occur, c; is incremented by the case 
weight. If it is time to event, d; is incremented by the case weight. 
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Calculation of Survival Functions 


For each interval, the following are calculated. 


Number Alive at the Beginning 
Ll, =G-1 —cj-1 — di-1 


where /; is the sum of weights of all cases in the table. 
Number Exposed to Risk of an Event 

nr, =l—o¢;/2 
Proportion Terminating 

US a 
Proportion Surviving 

p=1-4q 


Cumulative Proportion Surviving at End of Interval 


P; = Pip; 
where 
Po =1 


Probability Density Function 


f= 4=! 
Hazard Rate 
Mi = pot 


hi(1+pi) 


Standard Error of Probability Surviving 


se(P;) = P; 7 a;/(rip)) 
j=1 


Standard Error of Probability Density 


t-1 


se(fi) = P| S* ay /(raps) + vil (ria) 


j=l 


For the first interval 


P [pa 
Se (f1) = ag V a 


Standard Error of the Hazard Rate 


se(z) = (= {1 — (au)? 


If g; = 0, the standard error for interval i is set to 0. 


Median Survival Time 


If P;, > 0.5 the value printed for median survival time is 


bust 
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Otherwise, let i be the interval for which P; < 0.5 and P;_,; > 0.5. The estimate of the median 


survival time is then 


h;—1(P;-1—0.5) 


Md = (ti) 4 Pi1=P; 


Comparison of Survival Distributions 


The survival times from the groups to be compared are jointly sorted into ascending order. If 
survival times are equal, the uncensored is taken to be less than the censored. When approximate 
comparisons are done, they are based on the lifetables, with the beginning of the interval 
determining the length of survival for cases censored or experiencing the event in that interval. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


N Number of cases 

X (x) Survival time for case k, where times are sorted into ascending order so that 
case 1 has the shortest time and case N the longest 

Wh Weight for case k 

g Number of nonempty groups in the comparison 

W; Sum of weights of cases in group j 


W. Sum of weights of censored cases 
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WwW. Sum of weights of uncensored cases 
Ww Sum of weights of all cases 
Computations 


For each case the following are computed: 


mg ULE,,: Sum of weights of uncensored cases with survival times less than or equal to that 
of case k. 


C'LE}.: Same as above, but for censored cases. 
U E;,: Sum of weights of uncensored cases with survival times equal to that of case k. 


m C'E;: Same as above, but for censored cases. 


The score for case k is: 


go — J ULE: if X;, is censored 

“kK ) A, — Ay — Aa if X;, is uncensored 

where 

Ai = ULE, —UE, uncensored cases surviving shorter than casek 

Ay =W,—CLE,+CE; censored cases surviving longer than or equal to casek 
A3 = W,, —-ULE,, uncensored cases surviving longer than casek 


Test Statistic and Significance (Wilcoxon (Gehan)) 

The test statistic is 

W-1)B 
D=—p 
where 

g 
B=)_SS?/W; 

j=l 

SS; = the sum of scores of cases in group J 


N 
T=) 5; 
i=l 


Under the hypothesis that the groups are samples from the same survival distribution, D is 
asymptotically distributed as a chi square with (g—1) degrees of freedom. 
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T Test Algorithms 


The T Test procedure compares the means of two groups or (one-sample) compares the means 
of a group with a constant. 


Notation 

The following notation is used throughout this chapter unless otherwise stated: 
Table 98-1 
Notation 

Notation Description 

Xki Value for ith case of group k 

Whi Weight for ith case of group k 

Nh Number of cases in group k 

Wi Sum of weights of cases in group k 


Basic Statistics 


The following statistics are computed. 


Means 
Nk 
S- XpiWhi 
Xp = = a, k=1,2 
Variances 
Nk Nk 2 
vam ~ (Se ion] W, 
G2 — t=1 i=l 
YE (W.—-l 


Standard Errors of the Mean 


SEM; = Si. / JW 
Differences of the Means for Groups 1 and 2 
D = Xy 7 xX 
Unpooled (Separate Variance) Standard Error of the Difference 


= | g2 S2 
: =—= {= a2 
Sp \V W, | Ws 
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The 95% confidence interval for mean difference is 
D i=, tap’ Sp 


where ¢,,” is the upper 2.5% critical value for the ¢ distribution with df degrees of freedom. 


Pooled Standard Error of the Difference 


ge ee dee at, Hb 
ap =a PV Wy We 


where the pooled estimate of the variance is 


oz = (M1 )ST+(We-1 52 
P Wi+W2-2 


The 95% confidence interval for mean difference 
D+ taf S'p 


where df is defined in the following. 


The t Statistics for Equality of Means 


Separate Variance 


Pooled Variance 


t'=D/S'p 
df =W,+W.2-2 


The two-tailed significance levels are obtained from the t distribution separately for each of 
the computer t values. 


The Test for Equality of Variances 


The Levene statistic is used and defined as 


(w-2) SW (Za =A 
b i k=1 


2 


> sr Wri Zei — Ze 


k=11=1 


where 


ys WhiZhi 
Z,.= = Wh 
2 
:3 WZ 
Z = k=1 


The t Test for Paired Samples 


The following notation is used throughout this section unless otherwise stated: 


Table 98-2 
Notation 
Notation Description 
X; Value of variable X for case i 
Y; Value of variable Y for case i 
Wi Weight for case i 
Ww Sum of the weights 
N Number of cases 
Means 
a N 
X=) w;X;/W 
i=l 
- N 
Y= S- w;Y;/W 
i=1 
Variances 
N N 2 
ys Ww; Ke = (>. nx] VW 
92, _ i=l i=1 
>, a W-1 


Similarly for S}.. 


Covariance between X and Y 


N N N 
Sxy = qh (> XEYKWk — (>: vs) (>: w¥s) pw) 
k=l 


k=1 k=1 


Difference of the Means 


D=X-Y 
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Standard Error of the Difference 


Sp = (8h +52 —25xv)/W 


t statistic for Equality of Means 


t=D/Sp 


with (W—1) degrees of freedom. A two-tailed significance level is printed. 


95% Confidence Interval for Mean Difference 


Dittw-iSp 


Correlation Coefficient between X and Y 


Sxy 


isi Sx Sy 


The two-tailed significance level is based on 


Ww 2 
I-r2 


i ry 


with (W—2) degrees of freedom. 


One-Sample t Test 


Mean 


The following notation is used throughout this chapter unless otherwise stated: 


Table 98-3 
Notation 

Notation Description 

N Number of cases 

Xi Value of variable X for case i 
wi Weight for case i 

v Test value 


where WV = y~ w; is the sum of the weights. 


t=1 


Variance 


N 
4? r rs 2 
Sy = wo y w;(X; — X) 


i=1 


Standard Deviation 


Sx = VSx 


Standard Error of the Mean 


Sx = Sy / JW 


Mean Difference 


D=X-v 


The t value 


with (W—1) degrees of freedom. A two-tailed significance level is printed. 


100p% Confidence Interval for the Mean Difference 


Cl = Dt tw-1p41)/2S% 
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where fy —1.(,+1)/2 is the L00((p + 1)/2)% percentile of a Student’s ¢ distribution with (W—1) 


degrees of freedom. 
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1. Introduction 


Forecasting and prediction are important tasks in real world applications that involve decision making. In such 
applications, it is important to go beyond discovering statistical correlations and unravel the key variables that 
influence the behaviors of other variables using an algebraic approach. Many real world data, such as stock price 
data, are temporal in nature; that is, the values of a set of variables depend on the values of another set of variables at 
several time points in the past. Temporal causal modeling, or TCM, refers to a suite of methods that attempt to 
discover key temporal relationships in time series data. This chapter describes a particular method to discover 
temporal relationships using a combination of Granger causality and regression algorithms for variable selection. 
Although this treatment strives to be self-contained, a minimal set of papers describing the design principles behind 
the method can be found in [Lozano et al., 2011, Lozano et al., 2009, Arnold et al., 2007]'. 


The rest of the chapter is organized as follows. Section 2 lays the groundwork for the TCM algorithm (notation and 
brief history) and explains the greedy orthogonal matching pursuit (GOMP) [Lozano et al., 2011] algorithm that is 
used. Section 3 describes the techniques used to fit and forecast time series and compute approximated forecasting 
intervals. Section 4 describes scenario analysis, which refers to a capability of the TCM product to “play-out” the 
repercussions of artificially setting the value of a time series. Section 5 describes the detection of outliers, and 
Section 6 discusses how potential causes for outliers can be established using root cause analysis. 


2. Model 


Introduced by Clive Granger [Granger, 1980], Granger causality in time series is based on the intuition that a cause 
should necessarily precede its effect, and that if time series a causally affects time series b, then the past values of a 
should be useful in predicting the future values of a. More specifically, time series a is said to “Granger cause” time 
series b if the accuracy of regressing for b in terms of past values of both a and b is statistically significantly better 
than regressing just with past values of b. If the time series have T time points and are denoted by {a,}/_, and 
{b,}7_1, then the following regressions are performed: 


L Fs 
b~) a, it) Bybeny (1) 
j=l 


L 
by © >4, Bey (2) 


Here L is the number of lags; that is, the value of b at time t can only be determined by values of other time series at 
times {t — 1,t — 2,...,t — L}. If Equation (1) is statistically more significant (using some test for significance) than 
Equation (2), then a is deemed to Granger cause b. 


' The methods described in this chapter are particularly useful for under-determined systems, where the number of time 
series (n) far exceeds the number of samples (m); that is n >> m. Although these methods function for both over- 
determined (m >> n) and fully-determined (n == m) systems, there are other approaches to pursue for such systems. 
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2.1 Graphical Granger Modeling 


The classical definition of Granger causality is defined for a pair of time series. In the real world, we are interested 
in finding not one, but all the significant time series that influence the target time series. In order to accomplish this, 
we use group greedy (f,) regression algorithms with variable selection (see Section 2.3). An important feature of 
our TCM algorithm is that it groups influencer/predictor variables; that is, we are interested in predicting whether 
time series as a whole —{a;_1, @;_2, ..., @¢_,} has influence over time series b. Such grouping is a more natural 
interpretation of causality and also helps sparsify the solution set. For example, without such grouping we may 
select the time-lagged series a@,_2 to model b, but not select any other value of a, which increases the number of 
choices for variable selection L-fold, where L is the number of lags that is allowed. 


2.2 Notation 


The following notation is used throughout this chapter unless otherwise stated: 
Table 1: Notation 


Set of real numbers 


Regression solve operator 


£, norm of a vector, ie. || Z|l2 = Viz? 


Number of time points 


Number of lags for each target. L < m 
Design matrix of input series 


G:R™ XJ XL > Re-V*UIL —e lag matrix 
J=Upin-}. 1 Si, Sn for the set of column indices in J 


— M: R™** = Rext Computes means for k series 


S S$: QRMXK _, PEX1 Computes standard deviations for k series 
€ R Tolerance value for stopping criterion 
id 


Max number of predictors selected or 


maximum number of iterations 


Actual number of predictors selected for a 
, transformed scale 


In this section, we introduce the algorithm that is used to construct the temporal causal model. The list of symbols 
used in the rest of this chapter is summarized in Table 1. Most of the symbols are self-explanatory; however, the 
function G, which stands for grouping, requires some additional explanation. G is a function that takes a matrix 


(R™*"), a set of column indices J, and a lag value L and constructs a lag matrix that has (m — L) rows and (|J|L) 
columns. Basically, for every column index j € J,G constructs a (m — L)xL lag matrix by carefully unrolling the 
j" column of the input matrix. An example of G’s action is shown below: 
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qa bh Gy dy 


a, b, C2 a, a, ay 
G| X= a, bz; cz dy J =L=2)]-> az a2 
a, by cy cy a, a3 


as bs cs ds 


In this example, the input matrix X € R°** has 4 time series (n = 4) and five time points per time series (m = 5). 
The lag matrix associated with the time series in column 1, when L (lag) is 2, is produced by invoking G(X, {1}, 2). 
Note that the lag matrix consists of the lag-1 vector X of as the first column, the lag-2 vector as the second column, 
up to the lag-L vector as the L" column. Similarly, the functions (M,S) accept any input matrix and compute the 
mean and the standard deviation, respectively, of the matrix’s columns. For purposes of numerical stability, and to 
increase interpretability during modeling, columns of the lagged matrix are both centered by the column means and 
scaled by the column standard deviations *. On the other hand, the target y is only centered. An example of mean 
centering and scaling for the lagged matrices is shown below: 


a,—a, b,—b, 
b a, b, 
a 
1 1 az Tv ay b, _— by 
a, b, = = a D 
ag Co 
a, b 
= oF a,—a, b,;—b, 
a, b, 


Here, (@,,@,) and (b,,b,) are the means and standard deviations of the first and the second columns, (a, b) 


respectively. 


2.3 Group Orthogonal Matching Pursuit (GOMP) 
Algorithm 1: GOMP 


Input: X,y,G,M,S,L,e,K*, Joon, Jeer- 
Output: J.¢7. B. 
L 22 =6C Bald. 


a3 C fen 
2 fori € [1,Q@m—L)] do Xqug G+) = SOS.) 


3 B° = Xaug \ (y— MQ): 

4 1° =y—-M()— X2ugB: 

5 if any redundant series are found. delete them in /°.,: 

6 if (\J2.,| = K*). then J9., = J2.,(1: K*). update B*° . retum J°,,, B*° and stop: 
otherwise update B*° and r®: 

8 fork €1,2,3...(K* —|J&,|) do 

S$ (ff SaergnanGr’ *,6.8,5,...6)6; 5.3): 

10. if j* = —1, return J“ and B°@-» and stop: 


Sel 


Hi = Gt Ur): 


ag | 


* Although each column of the lagged matrix has a different mean and standard deviation, due to the structure of these 
columns, it is possible to compute the mean and the standard deviation of the time series itself and use those to center and 
scale the lagged columns. 
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12. fori € [1,(m—L)]do 

Xbiug(i:)-M(Xéug)™ 
S(XGug)™ 

14 Bpk= p< lee \ (y — M(y)): 

15 r*=y—M(y)— XE, B": 

16 Seer = Iser UI*: 

17 if || r*||, < e. break: 


18 return J*,. B’*. 


13. XGug(i,*) = 


We begin by describing Algorithm 1: GOMP, which will be used to establish causality of time-series data. This 
algorithm receives the variables X, y, G, M, S, L, €, K*(described in Table 1) as input. Briefly, y € RX isa 
target vector for which we want to establish the Granger causality (note that we have excluded the first L values of 
y). In contrast, X € R™*” is the input unlagged time series data. L is the number of lags for each predictor in each 
target series, K* is the maximum number of predictors to be selected per-target, and € determines whether a new 
predictor needs to be added. In addition, G, M and S are grouping, centering, and scaling functions which have 
been described in Section 2.2. J?,, is the set of pre-selected predictor indices for y, and always contains the lagged 
y. Je, is the set of forbidden predictors, if any, for y. If there are no forbidden predictors, then J5.,= @. Given 
these, the goal is to greedily find predictors that solve the system XB = y subject to sparsity constraints. 


The greedy algorithm approximates an f)-sparse solution by iteratively choosing the best predictor for addition at 
each iteration. We use superscripts to denote the iteration number in Algorithm 1. For example, J?,, represents the 
initial values of J,e, at the 0” iteration (before the actual iteration starts). The first part of the algorithm (lines 1 — 4) 
constructs and solves a linear system consisting of the predictors in J2,, to obtain B*°, the coefficient vector for 
predictors on the transformed scale. At the end of this first part, we have r°, the initial residual. Then check whether 
there are redundant predictor series in J?,,. If yes, then delete them. If the number of predictor series in the (updated) 
Jé., is equal to or larger than the maximum number of iterations (i.e., |J?.,| = K*) then keep the first K* predictor 
series in Je,,, update B*°, return J2., and B*°, and stop the process (line 6); otherwise (i.e., [J2.,| < K*), update B*° 
and r° (line 7) if any redundant predictor series were deleted. Then start the iterative process to add one predictor 
series at a time (line 8). The first step in predictor selection (line 9) consists of an argmin function that 
systematically goes over each eligible predictor and evaluates its goodness (see Algorithm 2). This step is the 
performance critical portion of the algorithm and can be searched in parallel. At the end of the step, j*, the index 
corresponding to the best predictor is available. However, if no suitable predictor is found in the argmin function 
(i.e., /* = —1), then return /*;* and B*-» and stop (line 10). The next part (lines 11 — 14) re-estimates the 

model coefficients by adding j* to the model. Line 15 updates the residual, r*, for this model and line 16 adds j* 

to the model. Finally, if the £2 norm of the current residuals is equal to or smaller than the tolerance value (i.e., 
(\l* || < €)), then the iterative process is terminated. 


Note that if the tolerance € is achieved by adding j*, then no new iterations are required and the iterative process 
is terminated. Thus the actual number of predictors selected, K, can be less than the maximum number of 
iterations, (i.e., K < K*). However, if the tolerance € is set very small, then it is highly unlikely that such a 
situation will happen. 


Algorithm 2: argmin 


Input: X,r,G,M,S,L,€, K’, fi Jon 
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Output: J,,,: Selected group index. 
1 cost = |Ir|l3. jse: = —1: 

2 forj€1,2,3..ndo 

3 if j € Jee: Il j © Jee) continue: 
Xe, = G(X,j,1); 


as 


: . Xgj(i:)-M(Xe,)7 
5 for i € [1,(m—L)] do Xo, )= “a 
S(Xe, 


6 B; =Xe,\r: 
7 rj=1—(Xc,B;) : 
S| 


if IIr;ll; < (cost — €2), then (cost, j,.,) = (lr; II..): 


return je). 


ie) 


The implementation of the argmin function (line 8, Algorithm 1) is shown in Algorithm 2. The algorithm first 
assigns the initial cost to be the square of the £2 norm of the current residuals, and the selected group index to be 
—1 (line 1). Then it loops over each series group, first checking if the time series being considered for addition (j) 
has already been added to the solution J,,; or if it is a forbidden predictor (line 3). If the current group (j) is not yet 
selected, the lagged transformed matrix corresponding to this time series (X G;) is constructed using the G, M and S 


functions (lines 4 and 5). After grouping and transforming X Gp the residual (17;) corresponding to the candidate time 
series j is computed by first regressing r on X, G; (line 6), and then computing the residual (line 7). Finally, the 


current time series is selected as the leading candidate if the square of the ?, norm of its residual (7;) is lower than 


the previous estimate minus a threshold value, €,. Including such a threshold value prevents selecting an (almost) 
identical series. 


The loop in Algorithm 2 (line 2) can be thought of as iterating over all candidate series. For each candidate series, 
the following computations are carried out: (1) a filter is applied in line 3 to ensure that it is a valid candidate; (2) 
lines 4 and 5 map the current candidate to the transformed matrix (X G;) that represents the lag matrix to be used; (3) 


lines 6 and 7 evaluate the goodness of the current candidate by first solving a dense linear system and then 
computing the residual; (4) line 8 applies a predicate to check if the current candidate series is better than previously 
evaluated candidates. Notice that the predicate (line 8) is associative and commutative; therefore, Algorithm 2 can 
be parallelized by dividing the iteration space ([1,n]) into chunks and executing each chunk in parallel. To get the 
globally best group, it is sufficient to reduce the groups that were selected by each parallel instance in a tree-like 
fashion by applying the predicate in line 8. 


2.4 Selecting L 


Both Algorithms 1 and 2 accept L as an input parameter which can be specified by user. If L is not explicitly 
specified then the following heuristic approach can be used to determine L based on (# of time points) and s 
(periodicity or seasonal length): 

(1) Ifs > 1 and m => 4s, then L = min(s, 20). 


(2) Ifs =1lorm < 4s. thenL = 5. 
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2.5 AR(L) Model 


Out of the n series in the data, some series may be used as predictors only, so no TCM models are built for them. 
However, if they are selected as predictors for some target series, then simple models need to be built for them in 
order to do forecasting. For example, suppose that time series 1 is a selected predictor for time series 2, but there is 
no model built for time series 1. While a model for time series 1 is not needed in order to forecast time series 2 at 
time (t + 1) (where t is the latest time in the data), forecasts for time (t + 2) require values of time series 1 for time 
(t + 1), which then requires a model for time series 1. 


Hence, for each predictor-only series, a simple auto-regressive (AR) model is built using the same lag, L, as used 
for the target series. This model, called an AR(L) model, can be constructed using Algorithm 1 by specifying J2,, to 
be the target itself and setting the maximum number of predictors to be 1. 


2.6 Post-estimation steps 


Algorithm 1 selects the best predictors (time series) to model a target series y. Without loss of generality, we 
assume that the model for y is y = y+ X¢B +r =9 +r, where Xz is the selected predictor series matrix with the 
lagged terms on the transformed scale, 8” is the estimated standardized coefficient vector, and r = y — 7 is the 
residual vector. 


However, this is not the end of modeling. Several post processing steps are needed in order to complete the 
modeling process for y. The steps include three parts: (1) coefficients and statistics inference; (2) tests of model 
effects; (3) model quality measures. 


2.6.1 Coefficients and statistical inference 


The results of Algorithm 1 include B* and (X*"X*)~ (by solving the linear system from Cholesky decomposition), 
where superscript T means the transpose of a matrix or vector, and (z)~ is a generalized inverse of the z matrix. 
Based on these quantities, the first step is to compute coefficient estimates, their standard errors, and statistical 
inference on the original scale. 


Table 2: Additional notation 


Notation Description 


K Actual number of predictors selected (including target itself) for y. i.e.. K = |J,.1|. 
Pp Number of coefficient estimates in B. ie..p=KXL 
p* Number of non-redundant coefficient estimates in B’. p’<=p 
Selected predictor series matrix with lagged terms on the transformed scale. This is an 
Xt (m — L) x p matrix as Xz = [XG,,.,XG,] with X,= G(X", j, L) = [XG,,, Xe, (an 
(m — L) X L matrix). 
Selected predictor series matrix on the original scale. This is an (m — L) X (p + 1) 
X- matrix as X, = [1, Xe. ce a = Eo om ety ees ate -Pae oe fee 8 where 1 isa 


column vector of 1’s corresponding to an intercept. 


B Unstandardized coefficient estimates vector (corresponding to X,). which is a (p + 1) X 
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1 vector. The first element, Bo- is the intercept estimate. 


Estimated variance of the model based on residuals. 


Covariance matrix of standardized coefficient estimates on the transformed scale. i.e.. 
=* = 67(X2' Xz%)-. The j diagonal element is Ge and its square root, Gg. is the 


standard error of the j™ standardized coefficent estimate. 


Covariance matrix of unstandardized coefficient estimates on the original scale. The j™ 
diagonal element is 65 and its square root, 6B,- is the standard error of the j** 
unstandardized coefficent estimate. 


Centering vector of X, i.e... M = [M,, ~ M,]'. where M; = M (x ) is the mean of X;. 
Scaling matrix of X,i.e., $ = diag[S,, eg $1. where S; = S (x q) is the standard deviation 
J he 
: , —M'S-1 ae 
Transformation matrix of X to X*.i.e.. A = so |: which is a (p + 1) X p vector. 
Note that XZ = X,A. 


The relationship between B and B is B = Ap’ + [¥,0,...,0]? and the relationship between Z and Z* is Z = 
AX*A’. The relevant statistics are computed as follows: 


e Unstandardized coefficient estimates 
B,=S7B}, f=? (3) 
Bo = ¥— M'S"*B" (4) 


e Standard errors of unstandardized coefficient estimates 


ed ae 

9B, 3" J=1,«.;p (5) 
af 

ee M, Mp] we S1 

of, = sqrt E. | = M, (6) 
Sp 


where Z* = 67(X¢7X;)~ and 6? = SS,/df, with SS, = |ly — HII} = Dine” Or — FH)? and df, =m— 
L—p*-1. 


e = t-statistics for coefficient estimates 


p=. J-Bin2 (7) 


which follows an asymptotic ¢ distribution with df,degrees of freedom. Then the p-value is computed as 
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Pr, = 2X (1 — prob(tar, < lé1)) (8) 
e 100(1 — a)% confidence internals 
Bj + Gp, X tay2,afe (9) 


where @ is the significance level and fg 2 qf, is the 100(1 — a/ 2) percentile of the t distribution with 


df, degrees of freedom. 


2.6.2 Tests of model effects 


For each selected predictor series for y, there are L lagged columns associated with it. The columns can be grouped 
together, considered as an effect, and tested with a null hypothesis of zero for all coefficients. This is similar to the 
test of a categorical effect with all dummy variables in a (generalized) linear model setting. Only type III tests are 

conducted here. For each selected predictor series X, ;, the type III test matrix L; is constructed and Ho: L;B = 0 is 


tested based on an F-statistic. 


+  F-statistics for effects 


~ Peay up - 
i 


F; 


where 7; = rank(L;ZL7). The statistic follows an approximate F distribution with the numerator degrees 
of freedom 7; and the denominator degrees of freedom df,. Then the p-value is computed as follows: 


Dr, = 1— prob(F,,,az, < lFil) (11) 


2.6.3 Model quality measures 


In addition to statistical inferences, the goodness of the model can be evaluated. The following model quality 
measures are provided: 
e Root Mean Squared Error (RMSE) 


RMSE = VMSE = 7 (12) 


Note that RMSE = 6. 


e Root Mean Squared Percentage Error (RMSPE) 


RMSPE = VMSPE = 


(13) 


e Rsquared 
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—L 
Te Ot)? _ 4» SSe 


R? =1— 
rw Or-I? SSt 


14 
Sa (14) 


e Bayesian Information Criterion (BIC) 


BIC = (m- L)In (= —)4 ((pS + 1)ln(m — L)) (15) 


(m-L) 


e Akaike Information Criterion (AIC) 


AIC = (m— L)in(= =) +2(p° +1) as’) 


3. Scoring 


Once the models (B, Jse:) for all the required targets (y) are built and post-estimation statistics are computed, the 
next task is to use these models to do scoring. There are two types of scoring: (1) fit: in-sample prediction for the 
past and current values of the target series; (2) forecast: out-of-sample prediction for future values of the target 
series. 


3.1 Fit 


Without loss of generality, we assume X and X¢ are the selected predictor series matrices without lagged terms and with 
lagged terms, respectively; and B is the coefficient estimates vector for the target y, so X = [Xj,...,Xx],X¢ = 
(DA egy eX Gp hips eS Ge | BO B = (BeBe ag Dake, oe es perl’: Given that all series have m time 


points, in-sample prediction of y is one-step ahead prediction and can be written as 


= X_ B= Bo + Dyer Se Bie X Get (16) 


= Bo + Die jee: Db- Bye X fee t= E+41,..,m. (17) 


The corresponding 100(1 — a@)% confidence interval of y is 


[Se = ta/2,dfe x6, J-— Ca/2,dfe x 6] 3 £=L4+1,..,m. (18) 


3.2 Forecast 
Given that data is available up to time interval m, the one-step ahead forecast for y is 

Im (1) = Bo + Ljey,,, Lb=1 Bye * Xjme1—e (19) 
The h-step ahead forecast for y is 

Im (A) = Bo + vie 


Yee Bie %, Ky nth— -f (20) 


sel 
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where 


a Xj,m+h-@) h<=f 
Xjm+h—e = 


Xk), h>t 


Thus, forecasting the value of y,,.2 requires us to first forecast the values of all the predictors up to time (m + 
1). Forecasting the values of all the predictors up to time (m + 1) requires us to use Equation (19) on all the 
predictors j € J,.,. Similarly, to predict the value of y,,,3, we need to forecast the values of predictors j € Js; at 
time (m + 2) by using Equation (20). This task poses a bigger problem; to forecast the values of j € J.) at 

time (m + 2), we first need to forecast the values of the predictors of j € J,.) at time (m + 1). That is, as we 
increasingly look into the future, we need to forecast more and more values to determine the value of y4p- 


3.3 Approximated forecasting variances and intervals 


In this subsection, we outline how forecasting variances and intervals can be computed for TCM models. We start 
by using the following representation for the linear model built by TCM for target y+): 


—_p L A 
Ym+h = Bo + Lie. Yen1 Bye : Xjm+h—£ + Em+h (21) 


where €m4n~N (0, 07) and o? is estimated as G? (computed in Section 2.6.1). Please note that we don’t include 
parameter estimation error when defining forecasting error in TCM. 


The forecasting error at m + 1 is defined as the difference between y,,,, and ¥,,(1), which can be written as 


@ u(t) = Ym+1 — Im (1) = Em+1 (22 


The forecasting variance for one-step ahead forecasts is computed as o*. For multi-step ahead forecasts, the 
forecasting error at m + his 


25 c = L A ‘ 
€y,m (h) = Ym+h — Ym (h) — Vie vent Bye : €x;,m (h _ e) + Em+h (23) 
where ex, m (h— €) = Xj m+n—e — z. (h — €) and ex jm(h —f)=0ifh<f. 


sel 


In general, €xjm(1), Xin (h — €) are not independent of each other. The larger the h is, the more complex the 
dependence is. In addition, Cx; m(h — €) and Cx jm Ch — ¢) might not be independent for j,i € J,.;. In order to fully 


consider the dependence, we need to write all time series in vector autoregressive (VAR) format. Since we assume 
the number of series n is usually large, the parameter matrix, which is an nXn matrix, might be too large to handle 
in computation of the forecasting variances. Therefore, we make the assumption that all forecasting error terms in 

Equation (23), CX im (h— €),j © Jeep’ = 1,...,L, are independent, so it is easier to compute the forecasting 


variances. 
Based on the above independence assumption, the approximated variance of the forecasting error, ey (1), is 


a2 _ L A2 n2 a2 
Sey mn Lielset Lent Bj,eFex, mn-e +o a 


Temporal Causal Modeling Algorithms 


_, is the variance of the forecasting error in the series X; atm+h-—f?. 


A2 
where 62 ai 


Then the corresponding 100(1 — @)% approximated forecasting interval of y,4, can be expressed as 
[Sm (A) — ta/2,dfe X Gey mn» Im(h) + tay2,afe X Be | (25) 


4. Scenario analysis 


Scenario analysis refers to a capability of TCM to “play-out” the repercussions of artificially setting the value of a 
time series. A scenario is the set of forecasts that are generated by substituting the values of a root time series by a 
vector of substitute values, as illustrated in Figure 1. 


Figure 1: Causal graph of a root time series and the specification of the vector of substitute values 
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During scenario analysis, we specify the targets that we want to analyze as a response to changes in the values of the 
root series (“a” in Figure 1), along with the time window. In Figure 1, we are interested in the behavior of time 
series “c”, “d”, “g”, “h”, and “j” only. The rest of the time series are ignored. The figure also depicts the vector aw 
of values for “a” that should be used instead of the observed or predicted values of “a”. The values 
(t,,te,T,T,) specify the beginning and end of the replacement values for the root series, the current time, and the 


farthest time for analysis, respectively. 


The partial Granger causal graph of time series “a” is shown in Figure 1. That is, “a” is the parent of itself, “b”, “c”, 
and “d”. Similarly, it is the grand-parent of “e”, “f”, “g”, “h”, “i”, and “j”. Further descendents are possible, butonly 
two generations suffice for the sake of explanation. Figure 1 also displays the specification of the vector aw, of 
length W, that contains the replacement values of the root series. In the example shown in the figure, aw starts at 
time t, < T, where T is the current time, and ends at t, > T, which is in the future. We are also given T,, the last 
time point (t, < T,) for which we want to perform scenario analysis on the target variables. Finally, we are given a 
set of time series for which the scenario predictions are carried out. In the figure, these are “c”, “d”, “g”, “h”, and 
“9”, which are marked with a thick red border. Since “b” is required to model “g”, “b” is marked with a thick blue 
border to signify that it is an induced target. Given this information, the goal of scenario analysis is to forecastthe 
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values of the target time series (“c”, “d”, “g”, “h”, and “j”) up to time T;, based on the values of the root time series ay. 
Notice that we have to predict values of targets up to time T;, where T, can be > (T + 1) or S$ (7 + 1). When J, = (T + 
2), we need to compute the values of the predictors of the target time series at time (T + 1). Similarly, when T, = (T + 
3), we need to compute the values of the predictors’ predictors at time (T + 1) and the values of the predictors at time 


(T + 2) before predicting the values of the target time series at time (T + 3). 


Figure 2: Scenarios with and without predicting future values 
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The left-hand panel in Figure 2 depicts a scenario where the values of ancestors of targets of interest also have to be 
predicted. In this particular case, T, = (T + 3) and therefore it is necessary to predict the values of the predictors 
of the targets at (T + 1) and (T + 2), and values of the predictors’ predictors at time (T + 1). The right-hand panel 
depicts a scenario where the entire period of prediction is earlier than the current time T (i.e., T, < T). In this case, 
all the values of the predictors and their ancestors are readily available. 


Determining aw 

In the discussion above, we have neglected the issue of aw, the substitute values for time series “a”, which is the 
root time series. For purposes of scenario analysis, it is sufficient to consider that aw is readily available. In a 
typical use case for scenario analysis, aw will come from the values specified by the user’s direct input, although 
its values could also come as input from a calling meta-process (as is the case with the use of scenario analysis as a 
sub- procedure in root cause analysis, as shown in Section 6). 

Caveat on scenario analysis 

It is possible to carry out scenario analysis for a time period that is entirely in the future; that is t, > T. However, 


forecasting errors in the remaining predictors may make such scenario analysis inherently low-precision. That is, if 
60 = t, —T and t, >T, then the precision of scenario analysis decreases with an increase in 0. 


4.1 SA, the scenario analysis algorithm 
Input: 


The inputs to SA are: (1) r: the root time series; (2) ry: the vector of replacement values for time series 1; (3) 
(t,,te,T,T,): the beginning and end time for the modified values of r, the current time, and the last time point for 
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which target values need to be predicted, respectively; (4) D: a set of descendant target time series of interest along 
with their relation to r (which may be input as the Granger causal graph, G). Notice that the length of ry is t. — 
t, +1andt, < T,. Furthermore, it is erroneous to have a target d € D, where r is not an ancestor of d. 


Output: 


For each d in D, we output a vector d,, containing values that pertain to the scenario analysis of these time 
series and the corresponding confidence intervals (when T, < T) or apprxomiated forecasting intervals (when 
T; > T). Please note that the time period for the children series in D is [t, + 1,7;], for the grand-children 
series is [t, + 2,T;], etc. 


Preparation: 


To prepare for SA, we first calculate the closure on the set of targets D* that need to be predicted, which is 
determined by the relationship between r and each of the targets in D. Essentially, D* is computed by iteratively 
looking at the path from each d € D and adding all those intermediate nodes that are ancestors of d and are also 
descendents of 7. In the example shown in Figure 1, the time series “b” is itself not of primary interest, but since it 


0 6 


is a parent of “g”, which is of interest, “b” is also added as a target of interest to the set {“c”, “d”, “g”, “h”, “j”}. 


Next, we compute M, the set of models that need to be included in order to perform scenario analysis on D*. 
Obviously, M contains the models for each of the series in D*, i.e., D* C M; however, depending on the time span 
of the scenario analysis, additional models of some time series might have to be brought in (see Figure 2). 
Basically, depending on how far ahead T, is from T, we may need to compute the values of the ancestors (other than 
r) of the targets of interest at time points (T + 1),...,(T, — 1). That is, the set {M — D*}(which may be @) contains 
all series that are needed for scenario analysis and are not descendants of r. 


At the end of the preparation phase we have D* and M, which allows us to predict all the time series of interest. 


Computation: 


The computation in scenario analysis is exactly that of scoring the values of a set of time series (see Section 3). For 
each target in D*, we have a range of time points for which we need to fit/forecast values. For example, for 
immediate children of the root (“c”, “d”, and the induced child “b” in Figure 1), this range is [t, + 1, T,]. Similarly, 
for grand-children (“g”, “h”, and “j” in Figure 1), this range is [t, + 2, T,]. Using the models in M and substituted 
values ry for 1, this task can be carried out. 


5. Outlier detection 


One of the advantages of building TCM models is the ability to detect model-based outliers. Outliers can be defined 
in several ways. For now, we shall define an outlier in a time series to be a value that strays too far from its expected 
(fitted) value based on the TCM models. The detection process is based on the normal distribution assumption for 
series y. Consider the value of a time series y at time t. Let y, and 9, be the observed and expected values of y at 
time t, respectively; and G? be the variance of y from the TCM model (based on residuals). Given these inputs, we 
call y, an outlier if the likelihood of y, when modeled as a normal random variable with mean 7, and variance 6? is 
below a particular threshold. 
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Input: 


The inputs to OD (outlier detection) are: (1) y;, V t; (2) §;, V t; (3) &?; (4) the outlier threshold value x € (0,1] 
(the default is 0.95). 


Computation: 


a) Under the assumption that the observed value y; is a normal random variable with mean f, and variance 
6”, compute the square score at time t as 


_ Ot-92)? 
Ssqrt ~~ gz (26) 


b) Compute the outlier probability as 


_ . aoa es 
Psqr,t = pr ob(xz = Rent) (27) 
where y¥? is a random variable with a chi-squared distribution with 1 degree of freedom. 


c) Flag y, as an outlier if peg, 2 K. 


Output: 


The output to OD for series y is a set of time points with their corresponding outlier probabilities. 


6. Outlier root cause analysis 


In Section 5, we saw how to detect outliers. The next logical step is to find the likely causes for a time series whose 
value has been flagged as an outlier. Outlier root cause analysis refers to the capability to explore the Granger causal 
graph in order to analyse the key/root values that resulted in the outlier under question. To formalize this notion, 
consider a time series y, whose observed value at time t (that is, y,) has been flagged as an outlier due to its 
abnormal deviation from its expected value 7. The goal of outlier root cause analysis (ORCA) is to output the set of 
time series A that can be considered as root causes of the anomalous value of y,. The idea is that setting the values 
of time series in the predictor set X to their normal/expected values, instead of their observed values, will bring the 
outlying y, back to normal. The normal value of y; is unknown so we specify it with the expected value of y at time 
t as predicted by y’s univariate model, which is an AR(L) model, and denoted as ¥;. 


The result of ORCA has the following objective function with a constraint as follows: 
arg MaxXyeany Se — Vel — \Stlx=2 — Je| (28) 


S.t.|¥e — Vel = |Vejx-2 — Vel 


where A, corresponds to the set of ancestors of y according to the Granger causal graph G. The quantity ;),-¢ 
should be interpreted as the likely predicted value of y at time t had the value of its ancestor x been set to its 
expected value of X. We see that Equation (28) is made up of two parts: (1) the portion |Y; — ¥,|, which is the 
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degree of “outlier-ness” of y at t as predicted by the “Granger model”, where the outlier-ness is judged based on 
what is expected from the history of y; (2) the portion |,,-~ — ¥;|, which is the degree of “outlier-ness” of y at t 
as predicted by the “Granger model”, if x was corrected. In other words, Equation (28) amounts to replacing the 
observed value y, by its “expected” value, given by a simpler, univariate model. Therefore Equation (28) 
expresses the reduction in the degree of outlier-ness in y, brought about by correcting x. 


6.1 ORCA, the outlier root cause analysis algorithm 

Input: 

The inputs to ORCA are: (1) y, the anomalous time series; (2) t, the time at which the anomaly was detected; (3) y,;, 
the anomalous value; (4) f,, the expected value of y,; (5) k, the oldest generation of ancestors to search based on the 
Granger causal graph, G. 


Output: 


ORCA outputs the set of root causes A of the anomaly in y;, where each x € A maximizes the objective function in 
Equation (28) by the same amount. 


Preparation: 


To prepare for ORCA, we first compute A,,, the set of ancestors that need to be examined as the potential root 


causes of the anomaly in y,;. 


Figure 3: Outlier root cause analysis for a time series 
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In the example shown in Figure 3, assuming that y=“a” and k = 2, then Ay = { “b”, “c”, “d”, “e”, “f?, “gp”, “h”, 
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}. Ay can be computed by performing a reverse breadth-first search from y to k levels. 


Second, each potential root cause x € Ay is prepped for scenario analysis by computing the vector of substitute 
values of x to be used during scenario analysis. Note that the length of this substitute vector is L, the lag. For 
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example, consider b,, the substitute for time series “b” in Figure 3. As “b” is a parent of “a”, we need to compute 
the fits of “b” from (t — L) to (t — 1). On the other hand, as “g” is a grand-parent of “a”, g, contains the fits for 
“” from the time (t — L — 1) to (t — 2) (see Section 3.1 for computation of fits). Please note that this approach 
assumes that any anomalies are purely in “b” (the parent series) or “g” (the grandparent series). In particular, it is 
assumed that anomalies in “b” are not caused by values in the grandparent series, including anomalous values in the 
grandparent series. 


Third, for each potential root cause x € Ay, scenario analysis is carried out (see Section 4) using the substitute 
values computed in the previous step. For the example in Figure 3, scenario analysis is called for series “b” with the 
parameters (r — b, ry = b,,tp = (t—L),te = (¢-1),T =t,D = {a},T, = t). And the result of scenario analysis 


is Pejx=2° 
Computation: 


The process of ORCA is as follows: 


e Initiaize A. the set of potential root causes for y,. to @. 
Initialize objmax. the maximum objective function value, to 0. 
¢ Suppose there are J series in Ay, Xj, ...,X;. 
For each x; j € 1,...,J. compute obj; = |¥z — Vel — |Feixj=2; — Tel: 


If obj; = Objmax- Set Objmax = Obj; and store x; in A. 
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TREE Algorithms 


The TREE procedure creates a tree-based classification model using the CART, CHAID, or 


QUEST algorithm. 


CART Algorithms 


The CART algorithm is based on Classification and Regression Trees by Breiman et al (1984). A 
CART tree is a binary decision tree that is constructed by splitting a node into two child nodes 
repeatedly, beginning with the root node that contains the whole learning sample. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y 


Xm, m=1,....M 


N 


R= {Xn Yn}ia1 
h(t) 

Wn 

Tr 

Uj), j=1,..J 
PG), JL, 
p(t) 


p(jld), j=1,....J 
c(i) 


Tree Growing Process 


The dependent, or target, variable. It can be ordinal categorical, nominal 
categorical or continuous. If Y is categorical with J classes, its class takes 
values in C = {1,..., J}. 

The set of all predictor variables. A predictor can be ordinal categorical, 
nominal categorical or continuous. 


The whole learning sample. 
The learning samples that fall in node t. 


The case weight associated with case n. 


The frequency weight associated with case n. Non-integral positive value is 
rounded to its nearest integer. 


Prior probability of Y=j,j=1,..., J. 

The probability of a case in class j and node t. 

The probability of a case in node t. 

The probability of a case in class j given that it falls into node t. 
The cost of miss-classifying a class j case as a class i case. C(j|j)=0 


The basic idea of tree growing is to choose a split among all the possible splits at each node so that 
the resulting child nodes are the “purest”. In this algorithm, only univariate splits are considered. 
That is, each split depends on the value of only one predictor variable. All possible splits consist 
of possible splits of each predictor. If X is a nominal categorical variable of J categories, there are 
2'~! _ 1 possible splits for this predictor. If X is an ordinal categorical or continuous variable with 
K different values, there are K—1 different splits on X. A tree is grown starting from the root node 
by repeatedly using the following steps on each node. 


1. Find each predictor’s best split. 


For each continuous and ordinal predictor, sort its values from the smallest to the largest. For the 
sorted predictor, go through each value from top to examine each candidate split point (call it v, 
if x<v, the case goes to the left child node, otherwise, it goes to the right) to determine the best. 
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The best split point is the one that maximize the splitting criterion the most when the node is split 
according to it. The definition of splitting criterion is in a later section. 


For each nominal predictor, examine each possible subset of categories (call it A, if x € A, the 
case goes to the left child node, otherwise, it goes to the right) to find the best split. 


2. Find the node’s best split. 
Among the best splits found in step 1, choose the one that maximizes the splitting criterion. 


3. Split the node using its best split found in step 2 if the stopping rules are not satisfied. 


Splitting Criteria and Impurity Measures 


At node t, the best split s is chosen to maximize a splitting criterion A? (s,t). When the impurity 
measure for a node can be defined, the splitting criterion corresponds to a decrease in impurity. 
AI (s,t) = p(t) Az(s,t)is referred to as the improvement. 


Categorical Dependent Variable 


If Y is categorical, there are three splitting criteria available: Gini, Twoing, and ordered Twoing 
criteria. At node t, let probabilities p(j,t), p(t) and p(j|t) be estimated by 


4 _ (7) Nw,j (#) 
pP(j,t) ~~ Nw,j 
p(t)=>_ p(,t) 
j 
p(t) p(j,t) 
PU) = 7 
(1) SUG. t) 
j 
where 
Nw i ss Wndnt (Yn re j) 
neh 
Nw j (t) = »; Wndnt (Yn = j) 


neh(t) 
with I(a=b) being the indicator function taking value 1 when a=b, 0 otherwise. 
Gini Criterion 


The Gini impurity measure at a node t is defined as 


i(t) = Ui jC (a7) p(t) p(t) 
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The Gini splitting criterion is the decrease of impurity defined as 
Ai(s,t) =7(t) — pri (tr) — pri(tr) 


where p;, and pp are probabilities of sending a case to the left child node ¢, and to the right child 
node ¢ ; respectively. They are estimated as py = p(t,)/p(t) and pp = p(tp)/p(t). 


Note: When user-specified costs are involved, the altered priors can optionally be used to replace 


the priors. When altered priors are used, the problem is considered as if no costs are involved. The 


altered prior is defined as x’ (j) = —CW*) | where ('(j) = YC (ij). 


dC) r() 
j 


Note: When the Gini index is used to find the improvement for a split during tree growth, only 
those records in node t and the root node with valid values for the split-predictor are used to 
compute Nj(t) and Nj, respectively. 


Twoing Criterion 


2 
Ai(s,t) =prp} >, |p (il) — PG nl 


j 
Ordered Twoing Criterion 


Ordered Twoing is used only when Y is ordinal categorical. Its algorithm is as follows: 


1. First separate the class C = {1, ..., J} of Y as two super-classes Cy and C2 = C—Cj such that Cj is 
of the form Cy = {1,..., ji}, j,=1,..., Jol. 


2. Using the 2-class measure i(t) = p(C{ | t)p(C2 | t), find the split s*(C 1) that maximizes 


2 


Ai(s,t) =i(t) —pri(tt) —pri(tr) =PLpr {p(j |tt) —p(a|tr)} 
JEC1 


3. Find the super-class C*; of Cy which maximizes Ai (s* (C)), ). 
Continuous Dependent Variable 


When Y is continuous, the splitting criterion Ai (s,t) =7(t) — pri(tz) — pri(tp) is used with the 
Least Squares Deviation (LSD) impurity measures 


_ 2 
3 Wndn(Yn —Y (t)) 
neh(t) 
> tia tn 


neEh(t) 


a(t) = 
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where 


PL>= Nw (tr) [Nw (t ), PR>= Nw (tr ) [Nw ( )s Nw (t) = s; abn 


neh(t) 


) Wr tn Un 


S34 neEh(t) 
y (t) = Niw(t) 


Stopping Rules 


Stopping rules control if the tree growing process should be stopped or not. The following 
stopping rules are used: 


m If anode becomes pure; that is, all cases in a node have identical values of the dependent 
variable, the node will not be split. 


m If all cases in a node have identical values for each predictor, the node will not be split. 


If the current tree depth reaches the user-specified maximum tree depth limit value, the tree 
growing process will stop. 


m If the size of a node is less than the user-specified minimum node size value, the node will 
not be split. 


m Ifthe split of a node results in a child node whose node size is less than the user-specified 
minimum child node size value, the node will not be split. 


m If for the best split s* of node t, the improvement AJ (s*,¢) = p(t) Ai (s*./) is smaller than 
the user-specified minimum improvement, the node will not be split. 


Surrogate Splits 


Given a split X* < s*, its surrogate split is a split using another predictor variable X, X < sx (or 
X > sx), such that this split is most similar to it and is with positive predictive measure of 
association. There may be multiple surrogate splits. The bigger the predictive measure of 
association is, the better the surrogate split is. 


Predictive measure of association 


Let fix-nx (resp. fix~nx (¢)) be the set of learning cases (resp. learning cases in node t) that has 
non-missing values of both X* and X. Let p(s* = sx|t) be the probability of sending a case in 
hx-nx (£) to the same child by both s* and sx, and Sx be the split with maximized probability 
p(s*  Sx|t) = max,, (p(s* & sx|t)). 


The predictive measure of association » (s* ~ §x|t) between s* and sx at node tis 


min (pr,pr) — (1 — p(s* & Sx/t)) 
min (pr, PR) 


A(s* & Sx|t) = 


where yy, (resp. p,,) is the relative probability that the best split s” at node t sends a case with 
non-missing value of X* to the left (resp. right) child node. And where 
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ifY is categorical 


N..(s* sx ,t) ours. F = 
Nu(X*nXxX) ifY is continuous 
with 


Nw (X*NX)= DY) wafasNw(X*NX,t1= D> wef, 


néhixenx néhix«nqx(t) 


Nw (8* ® 8x,t) = » Wnfnt (n:s* © sx) 


n€fix* x(t) 


Nw (X*OX)= Yo waft (yn =5)»Nwg(X*NX)= DY) wafpl (yn =35) 


nEhx+nx n€fix*nx(t) 


Nw,j (s* © sx,t) = > Wrdnt (Yn =Jj)I(n: s* © sx) 


n€hix«ax(t) 


and J (mn : s* & sx) being the indicator function taking value 1 when both splits s* and sx send 
the case n to the same child, 0 otherwise. 


Missing Value Handling 


If the dependent variable of a case is missing, this case will be ignored in the analysis. If all 
predictor variables of a case are missing, this case will also be ignored. If the case weight is 
missing, zero, or negative, the case is ignored. If the frequency weight is missing, zero, or 
negative, the case is ignored. 


The surrogate split method is otherwise used to deal with missing data in predictor variables. 
Suppose that X* < s* is the best split at a node. If value of X* is missing for a case, the best 
surrogate split (among all non-missing predictors associated with surrogate splits) will be used 
to decide which child node it should go. If there are no surrogate splits or all the predictors 
associated with surrogate splits for a case are missing, the majority rule is used. 


Variable Importance 


The Measure of Importance M(X) of a predictor variable X in relation to the final tree T is defined 
as the (weighted) sum across all splits in the tree of the improvements that X has when it is used 
as a primary or surrogate (but not competitor) splitter. That is, 


M(X) = > A(8x,t) 


teT 


If, for a given t, the rank of the surrogate is larger than the maximum number of surrogates to 
keep in each node, then the contribution of that split is set to 0. 
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The Variable Importance VI(X) of X is expressed in terms of a normalized quantity relative to 


the variable having the largest measure of importance. It ranges from 0 to 100, with the variable 
having the largest measure of importance scored as 100. That is, 


M(X) 


VI(X)= 
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CHAID and Exhaustive CHAID Algorithms 


The CHAID algorithm is originally proposed by Kass (1980) and the Exhaustive CHAID is by 
Biggs et al (1991). Algorithm CHAID and Exhaustive CHAID allow multiple splits of a node. 


Both CHAID and exhaustive CHAID algorithms consist of three steps: merging, splitting and 
stopping. A tree is grown by repeatedly using these three steps on each node starting from the 
root node. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y The dependent variable, or target variable. It can be ordinal categorical, 
nominal categorical or continuous. If Y is categorical with J classes, its 
class takes values in C = {1, ..., J}. 

Xm, m=1, ..., M The set of all predictor variables. A predictor can be ordinal categorical, 
nominal categorical or continuous. 

hi = {Xn Yn a ; The whole learning sample. 

Wn The case weight associated with case n. 

i The frequency weight associated with case n. Non-integral positive value is 


rounded to its nearest integer. 


CHAID Algorithm 


The following algorithm only accepts nominal or ordinal categorical predictors. When predictors 
are continuous, they are transformed into ordinal predictors before using the following algorithm. 
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Binning Continuous Predictors 


For a given set of break points a1, «3, ..., ¢ —; (in ascending order), a given x is mapped into 
category C(x) as follows: 


1 w<ay, 
Clays < je. ap << Gp S13 hk = 2 
i QK-1< 2 


If K is the desired number of bins, the break points are computed as follows: 


Calculate the rank of x;. Frequency weights are incorporated when calculating the ranks. If there 
are ties, the average rank is used. Denote the rank and the corresponding values in ascending 


n 


order as {7°(;), way}, ij 


For k =0 to (K —1), set, = fi: lr =| = k} where |x] denotes the floor integer of x. If I, is 
f 
not empty, i, = max{i:i € J,,}. The break points are set equal to the x values corresponding to the 


i,, excluding the largest. 


Merging 


For each predictor variable X, merge non-significant categories. Each final category of X will 
result in one child node if X is used to split the node. The merging step also calculates the adjusted 
p-value that is to be used in the splitting step. 


1. IfX has 1 category only, stop and set the adjusted p-value to be 1. 
2. If X has 2 categories, go to step 8. 


3. Else, find the allowable pair of categories of X (an allowable pair of categories for ordinal 
predictor is two adjacent categories, and for nominal predictor is any two categories) that is least 
significantly different (i-e., most similar). The most similar pair is the pair whose test statistic 
gives the largest p-value with respect to the dependent variable Y. How to calculate p-value under 
various situations will be described in later sections. 


4. For the pair having the largest p-value, check if its p-value is larger than a user-specified 
alpha-level merge. If it does, this pair is merged into a single compound category. Then a new 
set of categories of X is formed. If it does not, then go to step 7. 


5. (Optional) If the newly formed compound category consists of three or more original categories, 
then find the best binary split within the compound category which p-value is the smallest. 
Perform this binary split if its p-value is not larger than an alpha-level Osplit-merge: 


6. Go to step 2. 


7. (Optional) Any category having too few observations (as compared with a user-specified 
minimum segment size) is merged with the most similar other category as measured by thelargest 
of the p-values. 


8. The adjusted p-value is computed for the merged categories by applying Bonferroni adjustments 
that are to be discussed later. 
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Splitting 


The “best” split for each predictor is found in the merging step. The splitting step selects which 
predictor to be used to best split the node. Selection is accomplished by comparing the adjusted 
p-value associated with each predictor. The adjusted p-value is obtained in the merging step. 


1. Select the predictor that has the smallest adjusted p-value (i.e., most significant). 

2. If this adjusted p-value is less than or equal to a user-specified alpha-level Asplit split the node 
using this predictor. Else, do not split and the node is considered as a terminal node. 

Stopping 

The stopping step checks if the tree growing process should be stopped according to the following 
stopping rules. 

1. Ifanode becomes pure; that is, all cases in a node have identical values of the dependentvariable, 
the node will not be split. 

2. If all cases in a node have identical values for each predictor, the node will not be split. 

3. Ifthe current tree depth reaches the user specified maximum tree depth limit value, the tree 
growing process will stop. 

4. Ifthe size of a node is less than the user-specified minimum node size value, the node will not be 
split. 

5. Ifthe split of a node results in a child node whose node size is less than the user-specified 


minimum child node size value, child nodes that have too few cases (as compared with this 
minimum) will merge with the most similar child node as measured by the largest of the p-values. 
However, if the resulting number of child nodes is 1, the node will not be split. 


Exhaustive CHAID Algorithm 


Splitting and stopping steps in Exhaustive CHAID algorithm are the same as those in CHAID. 
Merging step uses an exhaustive search procedure to merge any similar pair until only a single 
pair remains. 


Also like CHAID, only nominal or ordinal categorical predictors are allowed, continuous 
predictors are first transformed into ordinal predictors before using the following algorithm. 


Merging 


1. 
2: 


If X has 1 category only, then set the adjusted p-value to be 1. 


Set index = 0. Calculate the p-value based on the set of categories of X at this time. Call the 
p-value p(index) = p(0). 


Else, find the allowable pair of categories of X that is least significantly different; that is, most 
similar. This can be determined by the pair whose test statistic gives the largest p-value with 
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respect to the dependent variable Y. How to calculate p-value under various situations will be 
described in a later section. 


4. Merge the pair that gives the largest p-value into a compound category. 


5. (Optional) If the compound category just formed contains three or more original categories, 
search for a binary split of this compound category that gives the smallest p-value. If this p-value 
is larger than the one in forming the compound category by merging in the previous step, perform 
the binary split on that compound category. 


6. Update the index = index + 1, calculate the p-value based on the set of categories of X at this 
time. Denote p(index) as the p-value. 


7. Repeat 3 to 6 until only two categories remain. Then among all the indices, find the set of 
categories such that p(index) is the smallest. 


8. (Optional) Any category having too few observations (as compared with a user-specified minimum 
segment size) is merged with the most similar other category as measured by the largestp-value. 


9. The adjusted p-value is computed by applying Bonferroni adjustments which are to be discussed 
in a later section. 


Unlike CHAID algorithm, no user-specified alpha-level is needed. Only the alpha-level dsplit is 
needed in the splitting step. 


p-Value Calculations 


Calculations of (unadjusted) p-values in the above algorithms depend on the type of dependent 
variable. 


The merging step of both CHAID and Exhaustive CHAID sometimes needs the p-value for a pair 
of X categories, and sometimes needs the p-value for all the categories of X. When the p-value for 
a pair of X categories is needed, only part of data in the current node is relevant. Let D denote the 
relevant data. Suppose in D there are I categories of X, and J categories of Y (if Y is categorical). 
The p-value calculation using data in D is given below. 


Scale Dependent Variable 


If the dependent variable Y is scale, perform an ANOVA F test that tests if the means of Y for 
different categories of X are the same. This ANOVA F test calculates the F-statistic and hence 
derives the p-value as 


I 
>° > wnfnl (tn = 1) (G; — 9)?/U —1) 


P= i=l neD 


7 
So waft (an = 8) (yn —H)/(Np -1) 


i=l nED 
p= Pre (t=, Nes I)> F) 
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where 


ihn ese 9h So Unfn¥n 
=< neD = neED T 
w=" — j=" —Ny= >" fa 
eS Wn fnt (Lp = i) Suntan Wy fn ne =D 


neED neED 


and F' (J — 1, Ny —/) is arandom variable following a F-distribution with degrees of freedom 
I-1 and Ny — I. 


Nominal Dependent Variable 


If the dependent variable Y is nominal categorical, the null hypothesis of independence of X and 
Y is tested. To perform the test, a contingency (or count) table is formed using classes of Y as 
columns and categories of the predictor X as rows. The expected cell frequencies under the null 
hypothesis are estimated. The observed cell frequencies and the expected cell frequencies are 
used to calculate the Pearson chi-squared statistic or likelihood ratio statistic. The p-value is 
computed based on either one of these two statistics. 


The Pearson’s Chi-square statistic and likelihood ratio statistic are, respectively, 


2 
SS ye ma (nig = Mig)” mij) 
mij 


i i=l 
wy) u, pd 
GC =2 ; 3 ni ln (nj /Mi;) 
j i=l 
where nj; = ae I, (an =tA Yn = J) is the observed cell frequency and 17; ; is the estimated 


expected cell frequency for cell (2, =i, y,, = j)following the independence model. The 
corresponding p-value is given by p = Pr (\4 > X7) for Pearson’s Chi-square test or 

p = Pr(\3 > CG’) for likelihood ratio test, where \; follows a chi-squared distribution with 
degrees of freedom d = (J—1)(I-1). 


Estimation of Expected Cell Frequencies without Case Weights 


mij; = a i 

where 
Te P, iy ij 

ie = ) Nij, Vj = ) Nin, = y S ij 
j=l i=l j=li=l 


Estimation of Expected Cell Frequencies with Case Weights 


If case weights are specified, the expected cell frequency under the null hypothesis of 
independence is of the form 


TREE Algorithms 
peace ate ei 
Miz = Wi; AD; 


where a; and 3; are parameters to be estimated, and 
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Parameters estimates 4;, 3;, and hence 7; ;, are resulted from the following iterative procedure. 
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mM; =W;; 0; 3; 
: (k+1) k) Rk+1) p)(k4+ (k+1) . : 
3. If max;,j jini Pes m| j | < ¢, stop and output as B' and m as the final estimates. 


Otherwise, k=k+1, go to step 2. 


Ordinal Dependent Variable 


If the dependent variable Y is categorical ordinal, the null hypothesis of independence of X and Y 
is tested against the row effects model, with the rows being the categories of X and columns the 
classes of Y, proposed by Goodman (1979). Two sets of expected cell frequencies, 7;; (under 
the hypothesis of independence) and 7 i; (under the hypothesis that the data follow a row effects 
model), are both estimated. The likelihood ratio statistic and the p-value are 


J 
j= 


I 
=25 ) ringy a (11%55/724) 
i=1 
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2 


p= Pr (x7.1 > H’) 


Estimation of Expected Cell Frequencies under Row Effects Model 


In the row effects model, scores for classes of Y are needed. By default, the order of a class of Y is 
used as the class score. Users can specify their own set of scores. Scores are set at the beginning 
of the tree and kept unchanged afterward. Let s; be the score for class j of Y,j = 1, ..., J. The 
expected cell frequency under the row effects model is given by 


Mii = Wi, B57 


where 
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in which w_; = Sj w;;, a;, 3; and; are unknown parameters to be estimated. Parameters 
estimates ,, 3 ;4; and hence jn; ; are resulted from the following iterative procedure. 
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If max;,; mii es ee mS €, stop and output al" 
estimates. Otherwise, k=k+1, go to step 2. 


Bonferroni Adjustments 


The adjusted p-value is calculated as the p-value times a Bonferroni multiplier. The Bonferroni 
multiplier adjusts for multiple tests. 


CHAID 


Suppose that a predictor variable originally has I categories, and it is reduced to r categories after 
the merging step. The Bonferroni multiplier B is the number of possible ways that J categories can 
be merged into r categories. For r = J, B= 1. For 2<r<I, use the following equation. 
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( : _ ) Ordinal predictor 
r—1 (r —v) I 
B= S- (> —— Nominal predictor 


I-2 I-2 ; as 
oe Ordinal with a missing category 


Exhaustive CHAID 


Exhaustive CHAID merges two categories iteratively until only two categories left. The 
Bonferroni multiplier B is the sum of number of possible ways of merging two categories at 
each iteration. 


tis 2) Ordinal predictor 
B= At het Nominal predictor 
I([-1 ; oe 
5 / Ordinal with a missing category 


Missing Values 


If the dependent variable of a case is missing, it will not be used in the analysis. If all predictor 
variables of a case are missing, this case is ignored. If the case weight is missing, zero, or negative, 
the case is ignored. If the frequency weight is missing, zero, or negative, the case is ignored. 


Otherwise, missing values will be treated as a predictor category. For ordinal predictors, the 
algorithm first generates the “best” set of categories using all non-missing information from the 
data. Next the algorithm identifies the category that is most similar to the missing category. 
Finally, the algorithm decides whether to merge the missing category with its most similar 
category or to keep the missing category as a separate category. Two p-values are calculated, one 
for the set of categories formed by merging the missing category with its most similar category, 
and the other for the set of categories formed by adding the missing category as a separate 
category. Take the action that gives the smallest p-value. 


For nominal predictors, the missing category is treated the same as other categories in the analysis. 
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QUEST Algorithms 


QUEST is proposed by Loh and Shih (1997) as a Quick, Unbiased, Efficient, Statistical Tree. It is 
a tree-structured classification algorithm that yields a binary decision tree. A comparison study of 
QUEST and other algorithms was conducted by Lim et al (2000). 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y The dependent, or target, variable. It must be nominal categorical. If Y is 
categorical with J classes, its class takes values in C = {1, ..., J}. 
Xm, m=1, ..., M The set of all predictor variables. A predictor can be nominal categorical or 
continuous (including ordinal categorical). 
R= {xn, yn} ‘ The whole learning sample. 
h(t) The learning samples that fall in node t. 
fu The frequency weight associated with case n. Non-integral positive value is 
rounded to its nearest integer. 
Nj Total number of learning cases, Ny = > fi 
; neh 
LV fj otal number of class j learning cases, Vy; = ol (Ue 9,) 
N Total number of class j | g Ny, fal (Yn =: 
neh 
Ny (t) Total number of learning cases in node t, N;(t) — Se *f. 
nehi(t) 
Ny,; (t) Total number of class j learning cases in node t, 
Nea ()= >_ fal (yw =J)- 
neh(t) 
mj), j=l... J Prior probability of Y=j,j=1,..., J. 
DG,t), j=1,....J. The probability of a case in class j and node t. 
p(t) The probability of a case in node t. 
DG, j=l... The probability of a case in class j given that it falls into node t. 
Cif) The cost of miss-classifying a class j case as a class i case. C(j|j)=0 


Tree Growing Process 


The QUEST tree growing process consists of the selection of a split predictor, selection of a 
split point for the selected predictor, and stopping. In this algorithm, only univariate splits are 
considered. 


Selection of Split Predictor 


1. For each continuous predictor X, perform an ANOVA F test that tests if all the different classes of 
the dependent variable Y have the same mean of X, and calculate the p-value according to the 
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F statistics. For each categorical predictor, perform a Pearson’s chi-square test of Y and X’s 
independence, and calculate the p-value according to the chi-square statistics. 


2 Find the predictor with the smallest p-value and denote it X™. 


3 If this smallest p-value is less than a / M, where a€(0,1) is a user-specified level of significance 
and M is the total number of predictor variables, predictor X* is selected as the split predictor 
for the node. If not, go to 4. 


4 For each continuous predictor X, compute a Levene’s F statistic based on the absolute deviation 
of X from its class mean to test if the variances of X for different classes of Y are the same, and 
calculate the p-value for the test. 


5. Find the predictor with the smallest p-value and denote it as X**. 


6 If this smallest p-value is less than o/(M + Mj), where Mj is the number of continuous predictors, 
X** is selected as the split predictor for the node. Otherwise, this node is not split. 


ANOVA F Test 


Suppose, for node t, there are ./; classes of dependent variable Y. The F statistic for a continuous 
predictor X is given by 


i 
=f 

Fy = 
> ic («, Se? (t)) (Ny )— : 
neh(t) 

where 


Ms InEnl (Yn J) Ss fn®n, 


nen(t) i nceh(t) 
N ) ’ z(t) = or 
Nj (t) N p(t) 


ze) (f) = 
Its corresponding p-value is given by 


px =Pr(F (4-1, Ny (t) — 4) > Fx) 


where F( J;-1 , .V; (¢) — J, follows an F distribution with J;-1 and Ny (t) — J; degrees 6 
freedom. 


Pearson’s Chi-Square Test 


Suppose, for node t, there are J; classes of dependent variable Y. The Pearson’s Chi-Square 
statistic for a categorical predictor X with J; categories is given by 
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; Gye Tite) 
xX“ = ) ) : 
: m 
j=1 t= 1 ‘J 
where 
e Rep ey a - ng. 
nig= >) fol (Yn =JGAIn =0) May = EAA 
neh(t) 
with 
Ie Ik Te Te 
nj, = Dre nj= ney = Sty 
j=l w=1 j=l i=1 
where I (y,, =j Av, =11if case n has y,, = 7 and v,, = i; 0 otherwise. 


The corresponding p-value is given by py = Pr(\7 > X7) where \3 follows a chi-squared 
distribution with degrees of freedom d = (J-1)( [,-1). 


Levene’s F Test 


For continuous predictor X, calculate z,, = «,, —7"") (t) |. The Levene’s F statistics for predictor 
X is the ANOVA F statistic for z,,. 


Selection of Split Point 


At a node, suppose that a predictor variable X has been selected for splitting. The next step is to 
determine the split point. If X is a continuous predictor variable, a split point d in the split X<d 
is to be determined. If X is a nominal categorical predictor variable, a subset K of the set of all 


values taken by X in the split XEK is to be determined. The algorithm is as follows. 


Continuous Splitting Predictor 


If the selected predictor variable X is continuous: 


1. Group classes of dependent variable Y into two super-classes. If there are only two classes of Y, 
go to step 2. Otherwise, calculate the sample mean of X for each class of Y. If all class means 
are identical, the class with the most cases is gathered as super-class A and the other classes 
as super-class B. If there are two or more classes with the same maximum number of cases, 
the one with the smallest class index j is chosen to form A and the rest to B. If not all the class 
means are identical, a k-means clustering method, with the initial cluster centers set at the two 
most extreme class means, is applied to class means to divide classes of Y into two super-classes: 
Aand B. Let 74 and s*, denote the sample mean and variance for super-class A, 7 and s3, the 
sample mean and variance for super-class B. 


2. If min (s4,,5%,) =0: order the two super-classes by their variance in increasing order and denote 
the variances by sj < s3, and the corresponding means by FT), «ty: Let € be a very small positive 


number, say =10°!2. If z, < zd =F, (1 +e)- Else, d=7,(1—c): 
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3. If min (s?,, s7,) # 0, quadratic discriminant analysis (QDA) is applied to determine the split 
point d. QDA assumes that X follows a normal distributions in each super-class with the 
calculated sample mean and variance. The split point is among the roots that make probability 


Pr (a, A|t) = Pr(a,B|t) for node t, where 

Pr(z;.A\|t)= P(2|'A,t)P(Alt) =P (Alt) exp{ ae, \ 
= A 4 

with 


) j,t) : T N 
p(Alt) = >_ ptt) = 2 py, ty— eo 
jEA Sea DP he) a 


Solving P(X, . 


= P(X, B|t) is equivalent to solving the following quadratic equation 


A 
ax- +br+e=0 


a = 84 — 8%, b = 2(T4sz — Ts) 
= “2 p(Alt)s 
C=UBs 2 — Tash 2658 sj, log Alten 
If there is only one real root, it is chosen to be the split point, provided this yields two non-empty 


nodes. If there are two real roots, choose the one that is closer to 4, provided this yields two 
non-empty nodes. Otherwise use the mean (74 + 7 ,) /2 as split point. 


Note: In step 3, the prior probability distribution for the dependent variable is needed. When user 
specified costs are involved, the altered pigs can be used to ae the priors (optional). The 
altered prior is defined as x (j)) = my , where C(j) = ©; C (77). 


Senay 


Nominal Splitting Predictor 


If the selected predictor variable X is nominal and with more than two categories (if X is binary, 
the split point is clear), QUEST first transforms it into a continuous variable (call it &) by assigning 
the largest discriminant coordinates to categories of the predictor. QUEST then applies the split 
point selection algorithm for continuous predictor on & to determine the split point. 


Transforming a Categorical Predictor into a Continuous Predictor 


Let X be anominal categorical predictor taking values in the set {by, ..., by}. Transform X into a 
continuous variable € such that the ratio of between-classes to within-classes sum squares of € is 
maximized (the classes here refer to the classes of dependent variable). The details are as follows: 
m Transform each value x of X in A into an J-dimensional dummy vector v = (vq, ..., Vj)’, where 
1 x2=b; 

Q otherwise * 


™ Calculate the overall and class j mean of v 


ue= 
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Yes Sofia! (Yn =) 


>= neh a(j) neh 
OS Nyy 


™ Calculate the following IxI matrices. 


B= 3 Ny (oo _ v) Ge = ) 
T= ss fu (Un — 0) (Yn — 3) 


neh 
m™ Perform single value decomposition on T to obtain T= QDQ’, where Q is an IxI orthogonal 
matrix, D = diag(d1, ..., dj) such that dj >... >dr> 0. Let D-z= diag(d; , ..., dt) where 
d; = d; ? if dj > 0, 0 otherwise. Perform single value decomposition on D~ 2 Q’ BQ D-? to 
obtain its eigenvector a which is associated with its largest eigenvalue. 


m@ The largest discriminant coordinate of v is the projection 


E=abl-2Qv 


Note: The original QUEST by Loh and Shih (1997) transforms a categorical predictor into a 
continuous predictor at a considered node based on the data in the node. This implementation 
of QUEST does the transformation only once at the very beginning based on the whole learning 
sample. 


Stopping 


The stopping step checks if the tree growing process should be stopped according to the following 
stopping rules. 


1. If anode becomes pure; that is, all cases belong to the same dependent variable class at the node, 
the node will not be split. 


2. If all cases in a node have identical values for each predictor, the node will not be split. 


3. If the current tree depth reaches the user-specified maximum tree depth limit value, the tree 
growing process will stop. 


4. If the size of a node is less than the user-specified minimum node size value, the node will not be 
split. 


5. If the split of a node results in a child node whose node size is less than the user-specified 
minimum child node size value, the node will not be split. 


Missing Values 


If the dependent variable of a case is missing, this case will be ignored in the analysis. If all 
predictor variables of a case are missing, this case will be ignored. If the frequency weight is 
missing, zero or negative, the case will be ignored. 
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Otherwise, the surrogate split method will be used to deal with missing data in predictor variables. 
If a case has a missing value at the selected predictor, the assignment will be done based on the 
surrogate split. The method of defining and calculating surrogate splits is the same as that in 
CART. For more information, see the topic “Missing Value Handling”. 
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Assignment and Risk Estimation Algorithms 


This section discusses how a class or a value is assigned to a node and to a case and three methods 
of risk estimation: the resubstitution method, test sample method and cross validation method. 
The information is applicable to the tree growing algorithms CART, CHAID, exhaustive CHAID 
and QUEST. Materials in this document are based on Classification and Regression Trees by 
Breiman, et al (1984). It is assumed that a CART, CHAID, exhaustive CHAID or QUEST tree has 
been grown successfully using a learning sample. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y The dependent variable, or target variable. It can be either categorical 
(nominal or ordinal) or continuous. If Y is categorical with J classes, its 
class takes values in C = {1, ..., J}. 

he= {xn, Un Ua i The learning sample where x,, and y,,are the predictor vector and dependent 

i variable for case n. 

A(t) The learning samples that fall in node t. 

fn The frequency weight associated with case n. Non-integral positive value is 
rounded to its nearest integer. 

Wn The case weight associated with case n. 

nj), j=1,....J Prior probability of Y = j 

Ci | jf) The cost of miss-classifying a class j case as a class i case, C(j | j)=0. 

Assignment 


Once the tree is grown, an assignment (also called action or decision) is given to each node based 
on the learning sample. To predict the dependent variable value for an incoming case, we first find 
in which terminal node it falls, then use the assignment of that terminal node for prediction. 
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Assignment of a Node 
For any node t, let d, be the assignment given to node ¢, 


Hine: J (t)  ¥ is categorical 
i y(t) Y is continuous 


j* (t) = argmin SC (é|j)p (I) 


where 


p(j.t) Nw j(t) 


Qo Pe a? 1b) = @) a 
rug ae 


j 


Nw = > Wnts Nw,j = ye Wilf (Un = Jj) 


neh neh 


Nu = So wafa Nag t= >> Wald Gn =a) 


neh(t) neEh(t) 


p(jlt) = 


If there is more than one class j that achieves the minimum, choose j*(t) to be the smallest such j 
for which Ny; (t) = SS I, 1 (Yn = j) is greater than 0, or the absolute smallest if Nf, j(d) is zero 


neh(t) 
for all of them. For CHAID and exhaustive CHAID, use 7 (j) = N,,_;/ NV, in the equation. 


Assignment of a Case 


For a case with predictor vector x, the assignment or prediction dy(x) for this case by the tree T is 


_ J J (t(x)) Y is categorical 
anes { y(t(x)) Y is continuous 


where ¢ (x) is the terminal node the case falls in. 


Risk Estimation 


Note that case weight is not involved in risk estimation, though it is involved in tree growing 
process and class assignment. 


Loss Function 


A loss function L(y, a) is a real-valued function in which y is the actual value of Y and a is the 
assignment taken. Throughout this document, the following types of loss functions are used. 
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Y / oe . l 
L(y,a) = { Z (aly) Y is categorica 


y—a - Y is continuous 


Risk Estimation of a Tree 


Suppose that a tree T is grown and assignments have been given to each node. Let 7’ denote the 
set of terminal nodes of the tree. Let D be the data set used to calculate the risk. Dropping all 
cases in D to T, let D(t) denote the set of cases that fall in node t. The risk of the tree based 
on data D is estimated by 


3 (j)L; Y categorical L Y categorical, M1 
mT (j) L; categorica ee ae i 
R(T|D) = F ea 3 1(j)L; Y categorical, M2 
- : F 
‘ ene L Y continuous 


where M1 represents empirical prior situation, and M2 non-empirical prior, and 


L = Ny Th L (Yn, dr (an)), L; = Ny ys frLb (Uns dr (ap Lun = J) 


neED : neED 


Ny = > Tas Njj = s fl (Yn = J) 


neED neED 


Assuming that L (y,,,d7 (a,,)) are independent of each other, then the variance of R(T) is 


estimated by 
9 
¢ Ss‘. 
y n(j)?—L Y categorical, M2 
Var (R(T)) = j Nf 
< Y con, or, Y cat and M1 
where 


9 : = \2 : 
85 = Ne SS Till (Yn, dr (~n)) ~ L;) (Yn = j) 


neD 
= Ny DL FL? (uns dr (4n))T (Ym = 5) — Li; 
neD 
32s 1 L ] : EE ae l  L? l ; Te 
ea Ny oa ‘(Yns dr (tn) — ) = Ny Sn  (Yny Ar (Xn) — 
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Putting everything together: 


wD » C (j* (t) |7) Nyy (t) ~— Y categorical, M1 


teT J 
7 (j) Sian 
R(T|D) = S- Ni. y C (3* (t)|3) Npj () Y categorical, M2 
wD > fn(yn — 9 (8) Y continuous 
teT neED(t) 


irl 2 Ny, (t (t) |3)? - vnc | Y cat, M1 
2 


Y cat, M2 


ae Y faltn —7())4 ne—xeio?} ¥ con 


teT nED(t) 


where 


= 7,34 iy =) 


neED(t) 


The estimated standard error of R(T|D) is given by se (2 (7'|D)) = \/var (R(T|D)). 
Risk estimation of a tree is often written as R(7'|D) = 2 R(t|D) with R(t|D) being the 
teT 
contribution from node t to the tree risk such that 


>, Ny; (t)C (7* (t) 7) =Y categorical, M1 


Ny fj 
j 
7(7) Ne; (t 
R(t|D) = x eu *(t)|j) Yeategorical, M2 
j 
ee S- Ffnr( Yn — y (t (t))? Y continuous 


neD(t) 
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Resubstitution Estimate of the Risk 


The resubstitution risk estimation method uses the same set of data (learning sample) that is used 
to grow the tree T to calculate its risk, that is: 


R(t) = R(t|h) 
R(T) =R(T\h) = Y- Rit) 
teT 


Var (R(T')) = Var(R(T\h)) 


Test Sample Estimate of the Risk 
The idea of test sample risk estimation is that the whole data set is divided into 2 mutually 


exclusive subsets f and h . fis used as a learning sample to grow a tree T and h is used as a test 
sample to check the accuracy of the tree. The test sample estimate is 


R"(T) = R(TI\n) 


Var (R'* (T)) = Var (R (Tit )) 


Cross Validation Estimate of the Risk 


Cross validation estimation is provided only when a tree is grown using the automatic tree 
growing process. Let T be a tree which has been grown using all data from the whole data set 
n°. Let V > 2 bea positive integer. 


1. Divide n° into V mutually exclusive subsets fi ,,, v= 1, ..., V. Let fi, beh?-h,,v=1,..., V. 

2. For each v, consider fi, as a learning sample and grow a tree 7, on/fi,, by using the same set of 
user specified stopping rules which was applied to grow T. 

3. After T,, is grown and assignment j;* (¢) or 7,, (¢) for node t of T,, is done, consider hy as a test 
sample and calculate its test sample risk estimate R'* (7,,). 

4. 


Repeat above for each v = 1, ..., V. The weighted average of these test sample risk estimates is 
used as the V-fold cross validation risk estimate of T. 


The V-fold cross validation estimate, R°’ (7), of the risk of a tree T and its variance are estimated 
by 


a mT (7) wt N fj R’ (T,\j) Y categorical, M2 
REV (T) = j eta u 
No N ys R® (Tr) Y con, or, Y cat and M1 
Uv 
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Var (RO ¥{T)) 


2s oS Mots (CE (0 19)? — NPRO(TY’ Y cat, M1 


J teT, 


2 


{EW ost ay =a) 
: to EE Mog CGS @ y=, + 


j v teT, fj 


Y cat, M2 


+4 S>S> YD fa(Yn — To (t))4 — NORM(T)? Y tid 


Vv teT, neh ,.(t) 


where 


Ri (Ty|7) = 


Nia S” fio I =D fol Gn =) 


nen? neh? 


References 


Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression 
Trees. New York: Chapman & Hall/CRC. 


Gain Summary Algorithms 
The Gain Summary summarizes a tree by displaying descriptive statistics for each terminal node. 
This allows users to recognize the relative contribution of each terminal node and identify the 
subsets of terminal nodes that are most useful. This document can be used for all tree growing 


algorithms CART, CHAID, exhaustive CHAID and QUEST. 


Note that case weight is not involved in gain summary calculations though it is involved in tree 
growing process and class assignment. 


Types of Gain Summaries 


Depending on the type of dependent variable, different statistics are given in the gain summary. 
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Average Oriented Gain Summary (Y continuous). Statistics related to the node mean of Y are 
given. Through this summary, users may identify the terminal nodes that give the largest (or 
smallest) average of the dependent variable. 


Target Class Gain Summary (Y categorical). Statistics related to an interested dependent 
variable class (target class) are given. Users may identify the terminal nodes that have a large 
relative contribution to the target class. 


Average Profit Value Gain Summary (Y categorical). Statistics related to average profits are 
given. Users may be interested in identifying the terminal nodes that have relatively large 
average profit values. 


Node-by-Node, Cumulative, Percentile Gain Summary. To assist users in identifying the 
interesting terminal nodes and in understanding the result of a tree, three different ways (node-by- 
node, cumulative and percentile) of looking at the gain summaries mentioned above are provided. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y The dependent, or target, variable. It can be either categorical (nominal 
or ordinal) or continuous. If Y is categorical with J classes, its class takes 
values in C = {1,..., J}. 


D Data set used to calculate gain statistics. It can be either learning sample 
data set or test sample data set. 
DO Cases in D in node t. 
Yn The dependent variable value for case n. 
fr The frequency weight associated with case n. Non-integral positive value is 
q y welg, gral p 
rounded to its nearest integer. 
Nj The number of cases in D, Ny = me ie 
neD 
Nf (t) The number of cases in D(t), Ny (t) = St f, 
ne D(t) 
Nj,j The number of class j cases in D, Ny; — S fl (Yn = J) 
nE€D 
IN #5 (t) The number of class j cases in D(t), Ny¢.j (f& S° frt (Yn =I) 
ne D(t) 
= 5 J = 1 
y(t) The mean of dependent variable in D(t), 7 (t) = AO) > fr ¥n 
ne D(t) 
j- Target class of interest; it is any value in {1, ..., J}. If not user-specified, 
: the default target class is 1. 
rj), eG) Respectively, the revenue and expense associated with class j. 
pv(j) The profit value associated with class j, pv(j) = r(j)-e(/) 
j° (b) Class assignment given by terminal node ¢. 


m(j) Prior probability of class j, j = 1, ..., J. 
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M1 For categorical Y, denotes the empirical prior situation. CHAID and 
exhaustive CHAID are always considered as having an empirical prior. 
M2 For categorical Y, denotes the non-empirical prior situation. 
Node by Node Summary 


The node-by-node gain summary includes statistics for each node that are defined as follows. 


Terminal Node 


The identity of a terminal node. It is denoted by ¢ 
Size: n 
Total number of cases in the terminal node. It is denoted by .V; (¢). 
Size: % 
Percentage of cases in the node. It is denoted by py (¢)100%, where p; (t) is given by 


N,(t) 


= M1, or, Y continuous 
-(t) = w (7) Ne; (t 
py (t) 3 OLA me 
:; Nyj 
j 
Gain: n 
Total number of target class j cases in the node, Np jv (i). 


This is only computed for the target class gain summary type. 


Gain: % 


Percentage of target class j" cases in the sample that belong to the node. It is denoted by 
py (i|j” )100%, where 


Py (li) 2 ae 


De i 
This is only computed for the target class gain summary type. 


Score 


Depending on the type of gain summary, the score is defined and named differently. But they 
are all denoted by s(¢ ). 
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Response: % (for target class gain summary only) 


The ratio of the number of target class j cases in the node to the total number of cases in the node. 


Ml 


M2 


i mG )N (4) 


Average Profit (for average profit value gain summary only) 


The average profit value for the node. 


do N55 (8) «ve (3) 
j 


s(i) = Ni() 
oe Cen 7 (7) Ney (t) pu (j) 

p,(t) 2 ve 

J 


M1 


Nf. 


Mean (for average oriented gain summary only) 


The respective mean of the continuous dependent variable Y at the node. 
s(t) =7(#) 


ROI (Return on Investment) 

ROI for a node is calculated as average profit divided by average expense. 
s(t) 
50 (t ) 


Where so (7) is the average expense for node ¢ and is calculated using equation fors (#) with 
pv(j) replaced by e(j). 


ROI (f) = 


This is only computed for the average profit value gain summary type. 


Index (%) 


For the target class gain summary, it is the ratio of the score for the node to the proportion of class 
j_ cases in the sample. It is denoted by is (£)100%, where is (¢) is 
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For the average profit value gain summary, it is the ratio of the score for the node to the average 
profit value for the sample. 


+ a) . r Ml 
S- Npgpv (3) /Ny 
off) — j 
is(t) — (7) = 
S°r (J) pu (j) 


j 
For the average oriented gain summary, it is the ratio of the gain score for the node to the gain 
score s(t = 1) for root node t = 1. 


Note: if the denominator is 0, the index is not available. 


Cumulative Summary 


In the cumulative gain summary, all nodes are first sorted with respect to the values of the score 
s(t). To simplify the formulas, we assume that nodes in the collection { bas Pas Sante 8 iz} are 
already sorted either in descending or ascending order. 


Terminal Node 


The identity of a terminal node. It is denoted by /.. 


Cumulative Size: n 
DN (és) ae Ny (is) 


Cumulative Size: % 


épy (t ) =S> py (és) 


i=1 


Cumulative Gain: n 


i=1 
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Cumulative Gain: % 
~ um 2 ~ ur 
Dpf (a ) = S "py (i. 3 ) 
i=1 


Score 


For Cumulative response, it is the ratio of the number of target class j\ cases up to the node to 
the total number of cases up to the node. For cumulative average profit, it is the average profit 
value up to the node. For cumulative mean, it is the mean of all y)’s up to the terminal nodes. In 


all cases, the same formula is used, but the appropriate formulas for s(/.) and pi /.) should be 
used in the calculations. This cumulative score is denoted by: 


Ss 


d_ si) Ny (&) 
i=l 5 M1, or, Y continuous 
; > Nr (fi) 
as(ts) = s j=l 
Ss (ti) - pri) 
lL. M2 


Cumulative ROI 


= Bs (ls 
Where ©sp (¢,) is the cumulative expense and calculated by equation for “)s (/,) with 
pv (t) replaced by e (i). 


This is only computed for the average profit value gain summary type. 


Cumulative Index % 


For the target class cumulative gain summary, it is the ratio of the cumulative gain score for the 
node to the proportion of class j cases in the sample. It is denoted by “is(¢,)100%, where: 


= s(t ) Ml 


is(ts) — 7 Led 
as(E.) M2? 
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For the average profit value cumulative gain summary, it is the ratio of the cumulative gain score 
for the node to the average profit value for the sample. 


». Nyy: pu(g) /Ny 
@is(ts) =< 7? ead) 


Y= 7 (5) pe (J) 


j 


For the average oriented cumulative gain summary, it is the ratio of the cumulative score for the 
node to the score s(t = 1) for root node t= 1. 


eis (f,) = ae - Ze is (ti) 


Note: if the denominator is 0, the index is not available. 


Percentile Summary 


Like cumulative gain summary, all nodes are first sorted with respect to the values of theirscores. 
To simplify the formulas, we assume that nodes in the collection { f,, fy, ..., ¢ r|} are already 
sorted in either descending or ascending order. Let q be any positive integer divisible to 100. The 
value of q will be used as the percentage increment for percentiles, and is user-specified (default 
q = 10). For fixed g, the number of percentiles to be studied is 100/q. The pth percentile to be 
studied is the pq%-tile, and its size is Ny.»_ = Ny + pqg% ,p=1, ..., 100/q. For any pq%-tile, let 
s, and s’,, be the two smallest integers in {1, ..., |‘7]} such that 


Ny.pq © (®Ns (tsp_1) 1 @NG (ts,)]> Neva € : DNs (8,1), Ne (2y,) ) 


Terminal Node 


The identity of all terminal nodes that belong to the pth increment. Nodet belongs to the pth 
increment if é € beer s,]. 
Percentile (%) 


Percentile being studied. The pth percentile is the pq%-tile. 


Percentile: n 


Total number of cases in the percentile, V;.,,, = [Vy - pq%], where [x] denotes the nearest 


integer of x. 
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Gain: n 


Total number of class j cases in the pq-percentile. 


N f-pq a DNy (7 a) 
N; (ts,) 


Sp 


Ng" (P) = ON gy (Fs,-1) 4 Nyy" (Fs) 


where BN’, (io) is defined to be 0. 


This is only computed for the target class percentile gain summary type. 


Gain: % 


Percentage of class j cases in the sample that belong to the pq%-tile. It is denoted by 
?p,_;” (pL00%, where 


?N pyr (p 
°pp ir (P) = APIA) 
ae 


This is only computed for the target class percentile gain summary type. 


Percentile Score 


For the target class percentile gain summary, it is an estimate of the ratio of the number of class 
j_ cases in the pq-percentile to the total number of cases in the percentile. For the average profit 
value percentile gain summary, it is an estimate of the average profit value in the pq-percetile. For 
the average oriented percentile gain summary, it is an estimate of the average of the gain score for 
all nodes in the percentile. In all charts, the same formula is used. 


’s(p) = @p, (i.,-1)-@s(#,,-1) +4 fio vie, Dat) M2 

where - 

Pf.pq = py (ts,-1) 4 <a Tm ea 
Percentile ROI 

non = 28 


where ?.,, (/”) is the percentile expense and calculated through equation 7; (,) with p»; (t) replaced 
by e (i): 
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This is only computed for the average profit value gain summary type. 


Percentile Index 


For the target class percentile gain summary, it is the ratio of the percentile gain score for the 
pq-percentile to the proportion of class 7 cases in the sample. It is denoted by ?is(p)100 percent, 


where 
_’s(p) Ml 
ris(p) = 2 Nest/Ns 
? S(p) 
M2 


mj") 


For the average profit value percentile gain summary, it is the ratio of the percentile gain score for 
the pq-percentile to the average of the profit values for the sample. 


: ?s(p) | i 
S > Ny g- pu (a) /Ny 
?is(p) = j . 
< M2 
a 7 (j)+pu(j) 


j 


For the average oriented cumulative gain summary, it is the ratio of the percentile gain score in 
the pq-percentile to the gain score s(t = 1) for root node t = 1. 


2s (p) = 


Note: if the denominator, which is the average or the median of yn’s in the sample, is 0, the 
index is not available. 


Cost—Complexity Pruning Algorithms 


Assuming a CART or QUEST tree has been grown successfully using a learning sample, this 
document describes the automatic cost-complexity pruning process for both CART and QUEST 
trees. Materials in this document are based on Classification and Regression Trees by Breiman et 
al (1984). Calculations of the risk estimates used throughout this document are given in TREE 
Algorithms. 


Cost-Complexity Risk of a Tree 


Given a tree T and a real number a, the cost-complexity risk of T with respect to a is 


Ry (T) = R(T) +a 


7 


TREE Algorithms 


where 7 | is the number of terminal nodes and R(T) is the resubstitution risk estimate of T. 


Smallest Optimally Pruned Subtree 


Pruned subtree. For any tree T, 7” is a pruned subtree of Tif 7” is a tree with the same root node 
as T and all nodes of 7’ are also nodes of T. Denote T ?T if T’is a pruned subtree of T. 


Optimally pruned subtree. Given a, a pruned T’subtree of T is called an optimally pruned 
subtree of T with respect to a if R, (T’) = minpy7R,(T"). The optimally pruned subtree may not 
be unique 


Smallest optimally pruned subtree. If T 2T" for any optimally pruned subtree T’?Ty such that 
Ra(|T ) = Ra (7 ., then /’ is the smallest optimally pruned subtree of To with respect to 
a, and is denoted by To(a). 


Cost-Complexity Pruning Process 


Suppose that a tree Tp was grown. The cost-complexity pruning process consists of two steps: 


1. Based on the learning sample, find a sequence of pruned subtrees {T;, a » of To such that To > 
Ty > Tp > ... > Tx, where Tx has only the root node of To. 


2. Find an “honest” risk estimate /2(T;,) of each subtree. Select a right sized tree from the sequence 
of pruned subtrees. 


Generate a sequence of smallest optimally pruned subtrees 


To generate a sequence of pruned subtrees, the cost-complexity pruning technique developed by 
Breiman et. al. (1984) is used. In generating the sequence of subtrees, only the learning sample 
is used. Given any real value a,j, (Qmin = 0) and an initial tree Tp, there exists a sequence 

of real values —0o < Q) = Quin < QQ < +++ <x < +00 anda sequence of pruned subtrees 
Ty?T,2++-?T x, such that the smallest optimally pruned subtree of To for a given a is 


To a<ay 
Tp (a) = To lop) = Tp Ap [aA< p41 1 <k<kKk 
To (a kK) = Tk aK La 


where 


An. = Miner, ge (t), Trpi = {t © Te : yx (8) > ag, for all ancestors of t} 


R(t)—R(T.+) Tr 
- te Ty — Ty 
gr (t)= 4 [Teal-d es 


+00 te Ty 


T),.4 is the branch of T;, stemming from node t, and R(t) is the resubstitution risk estimate of 
node t based on the learning sample. 
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Explicit algorithm 
For node t, let 


phe 0 t is terminal 
(1) = left child of t otherwise 


ri (t) = 0 t is terminal 
~ ) right childoft otherwise 
; 0 t is root node 
po (tb). 


parent oft otherwise 
1 


t is terminal 
Th | otherwise 


R(t) t is terminal 
R(Tj.4) otherwise 


G(t) = min gx (s) 


Then the explicit algorithm is as follows: 
Initialization. Set kK=1, o=omin 

For t=#To to 1: 

if t is a terminal node, set 

N (t)=1, S(Q=R(0), g()=G(t)=t00 

else set 

N (t)= N (It (t))+ N (rt (t)) 

S(t) = Scit(t) + S(rt(O) 

g(t) = (RQ) -SEYCN (t)-)) 

G() = min{g(d), G(It(d), G(rt(D)} 

Main algorithm. If G(1) >a, set 

a, =aandT; = {t € Th-1: g(s) > ax for all ancestor s of t } 
a=G(1), k=k+1 

else if V (1)=1, terminate the process. 


Set 1. 
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It(t) G(t)=G(It(t)) 
rt (t) otherwise 


While G(t) < g(t), t = { 
Make the current node t terminal by setting NV (¢)=1, S(t)}=R(t), g(t)}=G(t)=+00 
Update ancestor’s information of current node t; while t>1 

t=pa(t) 

N (t)= N (it (t))+ N (rt (#)) 

S(t) = S(It() + S(rt(Q) 

g(t) = (R(Q) -S(QYCN (t)-D) 

G(D = min{g(O, G(It()), G(rt()} 


Repeat the main algorithm until the process terminates. 


Selecting the Right Sized Subtree 


To select the right sized pruned subtree from the sequence of pruned subtrees {7}, } a , of To, an 
“honest” method is used to estimate the risk R(T) and its standard error se( R (T;,) of each 
subtree T. Two methods can be used: the resubstitution estimation method ahd the test sample 
estimation method. Resubstitution estimation is used if there is no test sample. Test sample 
estimation is used if there is a testing sample. Select the subtree T* as the right sized subtree of 
To based on one of the following rules. 


Simple rule 


The right sized tree is selected as the k* € {0, 1, 2, ..., K} such that 


R(Ty-) = min R (Tk) 


b-SE rule 


For any nonnegative real value b (default b = 1), the right sized tree is selected as the largest k** € 
{0, 1, 2, ..., K} such that 


Riles) < Ro) He se (R(T.-)) 


References 


Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression 
Trees. New York: Chapman & Hall/CRC. 


TSMODEL Algorithms 


The TSMODEL procedure builds univariate exponential smoothing, ARIMA (Autoregressive 
Integrated Moving Average), and transfer function (TF) models for time series, and produces 
forecasts. The procedure includes an Expert Modeler that identifies and estimates an appropriate 
model for each dependent variable series. Alternatively, you can specify a custom model. 


This algorithm is designed with help from professor Ruey Tsay at The University of Chicago. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Y; (t=1, 2, ..., n) Univariate time series under investigation. 
n Total number of observations. 
Y; (k) Model-estimated k-step ahead forecast at time t for series Y. 
S The seasonal length. 
Models 


TSMODEL estimates exponential smoothing models and ARIMA/TF models. 


Exponential Smoothing Models 


The following notation is specific to exponential smoothing models: 


a Level smoothing weight 

oi Trend smoothing weight 

o) Damped trend smoothing weight 
6 Season smoothing weight 


Simple Exponential Smoothing 


Simple exponential smoothing has a single level parameter and can be described by the following 
equations: 


ba ay (t)+(l1—a)L(t—1), if Y(t) is not missing 
7 L(t —1), else 


It is functionally equivalent to an ARIMA(0,1,1) process. 
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Brown’s Exponential Smoothing 


Brown’s exponential smoothing has level and trend parameters and can be described by the 
following equations: 


ie ay (t)+(l-a)L(t-—1), if Y(t) is not missing 
) = L(t-1)4+T(t-1), else 


T(t) = a(L(t)-—L(t—1))+ A-a)T(t-1), if Y (t)is not missing 
= T(t—1), else 


Y; (k) = L(t) + ((k —1) +07!)1 (t) 


It is functionally equivalent to an ARIMA(0,2,2) with restriction among MA parameters. 


Holt’s Exponential Smoothing 


Holt’s exponential smoothing has level and trend parameters and can be described by the 
following equations: 


na= aY (t)+(l-—a)(L(t-1)+T(t-1)), if Y (t)is not missing 
i L(t-1)+7(t-1), else 


T(t) = y(L(t) —L(t—-1))+ A —-y)T(t-1), if Y(t) is not missing 
= Fe=1). else 


Y, (k) = L(t) + kT (t) 


It is functionally equivalent to an ARIMA(0,2,2). 


Damped-Trend Exponential Smoothing 


Damped-Trend exponential smoothing has level and damped trend parameters and can be 
described by the following equations: 


pe ay (t)+ (l-—a)(L(t-—1)+yT(t-1)), if Y(t) is not missing 
. a L(t-1)+yT(t-1), else 


T(t) = y(L(t) —L(t—1))+ A —y)yT(t-—1), if Y (this not missing 
> pT (t — 1), else 


It is functionally equivalent to an ARIMA(1,1,2). 
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Simple Seasonal Exponential Smoothing 


Simple seasonal exponential smoothing has level and season parameters and can be described 
by the following equations: 


L(t) = a(Y (t)— S(t—s))4+ (l—a)L(t—1), if Y (t)is not missing 
ee L(t —1), else 


S(t) = d(Y (4) -—L(t))+(1-46)S(t—s), if Y(t) is not missing 
ot} = S(t —s) else 


Y, (k) =L(t)+S(t+k-—s) 


It is functionally equivalent to an ARIMA(0,1,(1,s,s+1))(0,1,0) with restrictions among MA 
parameters. 


Winters’ Additive Exponential Smoothing 


Winters’ additive exponential smoothing has level, trend, and season parameters and can be 
described by the following equations: 


L(t) = a(Y (t)-— S(t—s))+(l-a)(L(t-1)+T7(t-1)), if Y(t) is not missing 
a L(t -—1)+T(t—1), else 


T(t) = y(L(t) —L(t—1))+ —y)T(t-1), if Y(t) ts not missing 
a T(t—1), else 


))+(1—d)S(t—s), if Y(t) is not missing 
S(t—s) else 


It is functionally equivalent to an ARIMA(O,1,s+1)(0,1,0) with restrictions among MA parameters. 


Winters’ Multiplicative Exponential Smoothing 


Winters’ multiplicative exponential smoothing has level, trend and season parameters and can be 
described by the following equations: 


L(t) = a(Y (t)/S(t—s))+ (1—a)(L(t-1)4+7(t-1)), if Y(t) is not missing 
She L(t-—1)4+T(t—-1), else 


T(t) = y(L(t) —L(t-1))+ (1 —-y)T(t-1), if Y(t) is not missing 
= T(t—1), else 


S(t) = 0(Y (t)/L(t)) + (1—d)S(t—s), if Y(t) is not missing 
Ses S(t —s) else 
Y, (k) = (L(t) + kT (t))S (t +k — 5) 


There is no equivalent ARIMA model. 
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Estimation and Forecasting of Exponential Smoothing 


/ ; 2 
The sum of squares of the one-step ahead prediction error, © es —¥;_-, (1 )) , is minimized 
to optimize the smoothing weights. 


Initialization of Exponential Smoothing 


Let L denote the level, T the trend and, S, a vector of length s, denote the seasonal states. The 
initial smoothing states are made by back-casting from t=n to t=0. Initialization for back-casting is 
described here. 


For all the models L = y,,. 


For all non-seasonal models with trend, T is the negative of the slope of the line (with intercept) 
fitted to the data with time as a regressor. 


For the simple seasonal model, the elements of S are seasonal averages minus the sample mean; 
for example, for monthly data the element corresponding to January will be average of all January 
values in the sample minus the sample mean. 


For the additive Winters’ model, fit y = at 4 so 3,1; (t) to the data where t is time and 


i=l 
I; (t) are seasonal dummies. Note that the model does not have an intercept. Then T = —a, and 
S=8—- mean (3). 


For the multiplicative Winters’ model, fit a separate line (with intercept) for each season with time 
as a regressor. Suppose ,: is the vector of intercepts and 3 is the vector of slopes (these vectors 
will be of length s). Then T = —mean(@) and S$ = (ju + 3) / (mean (j) + mean (3)). 


The initial smoothing states are: 


L' = L(0) 
T’' = -T(0) 
S' = (S(1— s),S(2—s),...S(—1), $(0))= (S(1), S(2),.-.; S(—1 + 8), S(0)) 


ARIMA and Transfer Function Models 


The following notation is specific to ARIMA/TF models: 


a (t= 1, 2, ... , n) White noise series normally distributed with mean zero andvariance ¢” 
D Order of the non-seasonal autoregressive part of the model 

q Order of the non-seasonal moving average part of the model 

d Order of the non-seasonal differencing 

P Order of the seasonal autoregressive part of the model 

Q Order of the seasonal moving-average part of the model 

D 


Order of the seasonal differencing 
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s Seasonality or period of the model 
bp (B) AR polynomial of B of order p, @) (B) =1—yiB-y2.B? —...—y,)B? 
0,(B) MA polynomial of B of order q, 0, (B) =1—v01B — 2B? —...—v,B" 
Pp (B*) Seasonal AR polynomial of BS of order P, 
Dp (B*) =1-0,B* —0,B° —...-opB? 
Og (B*) Seasonal MA polynomial of BS of order Q, ; 
Og (B*) =1-—0,B* — 02.8 — ...-OgB°? 
A Differencing operator A = (1 — B)"(1 — B*)” 
B Backward shift operator with BY; = Y;-, and Ba; = ak_1 
Zo; Prediction variance of Z; 
No? Prediction variance of the noise forecasts 


Transfer function (TF) models form a very large class of models, which include univariate ARIMA 
models as a special case. Suppose Y; is the dependent series and, optionally, \,,, X»o;,..., Xj, are 
to be used as predictor series in this model. A TF model describing the relationship between the 
dependent and predictor series has the following form: 


Ze =f (%), 


MA 
AR 


at. 


ko4 

Num; : 

AZ, = p+ S Teng A,B” f(Xit) + 
i=1 


The univariate ARIMA model simply drops the predictors from the TF model; thus, it has the 
following form: 


MA, 
AZ, =i “aR &t 


The main features of this model are: 


m™ An initial transformation of the dependent and predictor series, f and fj. This transformation 
is optional and is applicable only when the dependent series values are positive. Allowed 
transformations are log and square root. These transformations are sometimes called 
variance-stabilizing transformations. 


m A constant term ju. 

m The unobserved i.i.d., zero mean, Gaussian error process a, with variance co”. 

m The moving average lag polynomial MA=6, (8)O, (B*) and the auto-regressive lag 
polynomial AR=0,, (B)® p (B*). 

m The difference/lag operators A and Aj. 

m A delay term, B', where b; is the order of the delay 

m Predictors are assumed given. Their numerator and denominator lag polynomials are 
of the form: Numj;=(wjo — wi, B — +++ — wi, B")(1 — 24, B® — --» —O;,B"*)B? and 
Denj=(1 — 6;,B — +--+ — bj, B")(1 — Ay, BY — +--+). 


m The “noise” series 
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k = 
N 
N, = AZ, — p— >> tA BX, 


en 
i=1 : 


is assumed to be a mean zero, stationary ARMA process. 


Interventions and Events 


Interventions and events are handled like any other predictor; typically they are coded as 0/1 
variables, but note that a given intervention variable’s exact effect upon the model is determined 
by the transfer function in front of it. 


Estimation and Forecasting of ARIMA/TF 


There are two forecasting algorithms available: Conditional Least Squares (CLS) and Exact Least 
Squares (ELS) or Unconditional Least Squares forecasting (ULS). These two algorithms differ in 
only one aspect: they forecast the noise process differently. The general steps in the forecasting 
computations are as follows: 


1. Computation of noise process .V; through the historical period. 


2. Forecasting the noise process \; up to the forecast horizon. This is one step ahead forecasting 
during the historical period and multi-step ahead forecasting after that. The differences in CLS 
and ELS forecasting methodologies surface in this step. The prediction variances of noise 
forecasts are also computed in this step. 


3. Final forecasts are obtained by first adding back to the noise forecasts the contributions of the 
constant term and the transfer function inputs and then integrating and back-transforming the 
result. The prediction variances of noise forecasts also may have to be processed to obtain the 
final prediction variances. 


Let N, (A) and o; (k:) be the k-step forecast and forecast variance, respectively. 


Conditional Least Squares (CLS) Method 


MN, (k) = E(Ne+n|N:, Ne_1,---) assuming N, = 0 for t<0. 


where 7; are coefficients of the power series expansion of M A/ (A x AR). 
9 


Minimize S = =(M M (1) | 


Missing values are imputed with forecast values of N;. 


Maximum Likelihood (ML) Method (Brockwell and Davis, 1991) 


Ne (k) = E (Nese |Ne, Ne—1,°++5 1) 


TSMODEL Algorithms 


Maximize likelihood of {.N rr — N; (yh ; that is, 
nh 


L=-—I1n(S/n) — (1/n) In (1);) 


j=l 
7 m oA 2 / 9 9 : . 
where S = z( Ni — Ni (1 )) /m, and 07 = 0°17); is the one-step ahead forecast variance. 


When missing values are present, a Kalman filter is used to calculate N; (k). 


Error Variance 


bo 


a- = S/(n—k) 
in both methods. Here n is the number of non-zero residuals and k is the number of parameters 


(excluding error variance). 


Initialization of ARIMA/TF 


A slightly modified Levenberg-Marquardt algorithm is used to optimize the objective function. 
The modification takes into account the “admissibility” constraints on the parameters. The 
admissibility constraint requires that the roots of AR and MA polynomials be outside the unit circle 
and the sum of denominator polynomial parameters be non-zero for each predictor variable. The 
minimization algorithm requires a starting value to begin its iterative search. All the numerator and 
denominator polynomial parameters are initialized to zero except the coefficient of the Oth power 
in the numerator polynomial, which is initialized to the corresponding regression coefficient. 


The ARMA parameters are initialized as follows: 
Assume that the series Y; follows an ARMA(p,q)(P,Q) model with mean 0; that is: 
Ye — pi Ye-1 — +++ — ppYi-p = Gt — 0104-1 — +++ — OqGt—q 


In the following c, and ,, represent the /th lag autocovariance and autocorrelation of 
Y; respectively, and ¢; and / represent their estimates. 


Non-Seasonal AR Parameters 


For AR parameter initial values, the estimated method is the same as that i in appendix A6.2 of 
(Box, Jenkins, and Reinsel, 1994). Denote the estimates as (3), --- 


a ae 
Non-Seasonal MA Parameters 
Let 
w= % —yiYi-1-—-:: PpYt p — at — 01a4-1 —-°: Aga4 q 


The cross covariance 
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2 
or a 

—0,07 l=1 

Ay = E(weyiat) = E((@e41 — Opae4i—1 — +++ — Og @t41-g)at) vee . 
—O,0, l= 
0 l>q 

Assuming that an AR(p+q) can approximate Y;, it follows that: 

Z , S , = ! , - ; 

Y -YP 1Yt-1 rete ipo P p¥t-p Me ptivt—p-1 Se Y p+qYt—p—a = at 

The AR parameters of this model are estimated as above and are denoted as (3,,---, Gp Eq? 

Thus 4; can be estimated by 

_ be ss “ v , , 
A B((Ye — gi Yisi-1 — ++ - Pee) (vi —94yM%-1-'''-9Y oeateea)) 
p+4q p pt+q 
= | pl —So yipinj - ee >) >) pivejorss— a) 
j=l i=1 j=1 
And the error variance o? is approximated by 
p+q ptapt+d p+qapt+q 
~? _ , 2 , v not 
a2 =Var[ -Se Ms) =D vw jes = rind 
j=0 i=0 j=0 i=0 j—0 
with ~) = —1 
Then the initial MA parameters are approximated by 0; = —\,/c? and estimated by 
pr p pt+d 
o> Pips; - Yoder YoY Bsn 
nn ee ry j=l i=1 j=l 
= —Ai/oq = ptap+q 
Dd DFP Pins 
i=0 j=0 


So 4; can be calculated by (3 Pis¢ gi, and {pi}) 4 °4 In this procedure, only {i };_;' are used and all 
other parameters are set to 0. 


Seasonal parameters 


For seasonal AR and MA components, the autocorrelations at the seasonal lags in the above 
equations are used. 


Calculation of the Transfer Function 


The transfer function needs to be calculated for each predictor series. For the predictor series X ;,, 
let the transfer function be: 
o Num; 


it = —— A,B? fi(Xit) 


en; 
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It can be calculated as follows: 
1. Calculate Uj, = A; B"’ f;(Xix) 


2. Recursively calculate 


a > Cden; (J) * Vit_ a Chum; (j) * Uit-j 


j=0 


where C'den; (j) and Cnum,; (j) are the coefficients of B’ in the polynomials Den; and 
Num, respectively. Likewise, the summation limits D; and N; are the maximum degree of B/ in 
the polynomials Den; and Num, respectively. 


All missing V;,_ ; in the first term of V;, are taken to be V;_ . and missing U;,_ ; in the second term 
are taken to be U;., where Uj__ ,, is the first non-missing measurement of U;,. V;,_. is given by 


Vie Num; ha = 
Den; (1) 
where Num, (1) and Den; (1) are the Num; and Den; polynomials evaluated at B = 1. 


Diagnostic Statistics 


ARIMA/TF diagnostic statistics are based on residuals of thenoise process, R(t) = N (t) — N (t). 


Ljung-Box Statistic 


K 
Q (Kk) = n(n +2) > —rz/( /(n —k) 
k=1 


where r;, is the kth lag ACF of residual. 


Q(K) is approximately distributed as ,? (A — m), where m is the number of parameters other than 
the constant term and predictor related-parameters. 


Outlier Detection in Time Series Analysis 


The observed series may be contaminated by so-called outliers. These outliers may change the 
mean level of the uncontaminated series. The purpose of outlier detection is to find if there are 
outliers and what are their locations, types, and magnitudes. 


TSMODEL considers seven types of outliers. They are additive outliers (AO), innovational 
outliers (IO), level shift (LS), temporary (or transient) change (TC), seasonal additive (SA), local 
trend (LT), and AO patch (AOP). 
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Notation 
The following notation is specific to outlier detection: 


U(t) or U; The uncontaminated series, outlier free. It is assumed to be a univariate ARIMA or 
transfer function model. 
Definitions of Outliers 


Types of outliers are defined separately here. In practice any combination of these types can 
occur in the series under study. 


AO (Additive Outliers) 
Assuming that an AO outlier occurs at time t=T, the observed series can be represented as 
Y (t) =U (t) + wlp(t) 
0 


where Ip (¢) = 1 ' 7 is a pulse function and w is the deviation from the true U(T) caused 


by the outlier. 


10 (Innovational Outliers) 


Assuming that an IO outlier occurs at time t=T, then 


LS (Level Shift) 
Assuming that a LS outlier occurs at time t=T, then 
Y (t) =U (t) + wSr (¢) 

. : is a step function. 


bs / t 
where Sr (¢) = 5/7 (t) = . : 


TC (Temporary/Transient Change) 
Assuming that a TC outlier occurs at time t=T, then 
Y (t) =U(t)+wDr, (t) 


where Dy (ft) = lr (t), 0 < 6 < 1is a damping function. 


SA (Seasonal Additive) 


Assuming that a SA outlier occurs at time t=T, then 
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where SSp (t) = 514 (t) = { ; a i a k=0 is a step seasonal pulse function. 


LT (Local Trend) 


Assuming that a LT outlier occurs at time t=T, then 


Y (t) =U(t)+ wTy (t) 


where Tp (/) = aol ({)= . a cs ee is a local trend function. 


AOP (AO patch) 


An AO patch is a group of two or more consecutive AO outliers. An AO patch can be described 
by its starting time and length. Assuming that there is a patch of AO outliers of length k at time 
t=T, the observed series can be represented as 


k 
¥ (t) =U) + >) wilr-14i (€) 
i=1 


Due to a masking effect, a patch of AO outliers is very difficult to detect when searching for 
outliers one by one. This is why the AO patch is considered as a separate type from individual 
AO. For type AO patch, the procedure searches for the whole patch together. 


Summary 


For an outlier of type O at time t=T (except AO patch): 


Y (t) = p(t) + wLo (B) Ip (t) + a(t) 


where 
1 O = AO 
1/(An(B)) O=10 
_) 1/1-B) O=LS 
Lo (B) = /(1—6B) O=TC 


1/ ) 
1/(1-B*) O=SA 
1/1-—B)? O=LT 


with 7 (B) = »(B)/@(B). A general model for incorporating outliers can thus be written as 


follows: 
M 
6(B 
Y= H+ wikoy (BYIn 0 +E Gel 


where M is the number of outliers. 
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Estimating the Effects of an Outlier 


Suppose that the model and the model parameters are known. Also suppose that the type and 
location of an outlier are known. Estimation of the magnitude of the outlier and test statistics 
are as follows. 

The results in this section are only used in the intermediate steps of outlier detection procedure. 
The final estimates of outliers are from the model incorporating all the outliers in which all 
parameters are jointly estimated. 


Non-AO Patch Deterministic Outliers 


For a deterministic outlier of any type at time T (except AO patch), let e (t) be the residual and 
z(t) =7(B)L(B)AI, (t), so: 


e(t) = wa (t) + a(t) 


From residuals e(t), the parameters for outliers at time T are estimated by simple linear regression 
of e(t) on x(t). 


For j = 1 (AO), 2 (IO), 3 (LS), 4 (TC), 5 (SA), 6 (LT), define test statistics: 


14(T) 
AAT) = wD 
7) [Warten (F)) 


Under the null hypothesis of no outlier, \,(T) is distributed as N(0,1) assuming the model and 


model parameters are known. 


AO Patch Outliers 


For an AO patch of length k starting at time T, let x; (¢;7) = 7(B) Al; ,;_ (t) fori=1 tok, then 
k 
e(t) = Sow; (T) 2; (t;T) + a(t) 


i=l 


Multiple linear regression is used to fit this model. Test statistics are defined as: 
? (T) -_ ae ( / Xr Xs )w(T) 


Assuming the model and model parameters are known, \7 (7) has a Chi-square distribution with k 
degrees of freedom under the null hypothesis w, (7) =--- = w, (T) =0. 


Detection of Outliers 


The following flow chart demonstrates how automatic outlier detection works. Let M be the total 
number of outliers and Nadj be the number of times the series is adjusted for outliers. At the 
beginning of the procedure, M = 0 and Nadj = 0. 
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Figure 100-1 


Input: series to forecast, predictors, seasonal length, model (if it is known) 
Is the model known? 


N 


0 
Is Nadj > 1? 


Yes 


Check residual for an outlier. 
Is an outlier found? 
A.djust residual for the outlier 
found. K=K+1 
No 


Is M>0? 


M=M+K. Incorporating all 
M outliers, fit the model. 


No 
—_ : Incorporating all M outliers, 
Delete insignifcant outliers Stop. fit and delete insignificant 
one at a time until all are No outliers. parameters one at atime until 


significant. Update M. all are significant. Update M. 


Final model 


Adjust original data for all M 
outliers. Nadj=Nadj+1 


Goodness-of-Fit Statistics 


Goodness-of-fit statistics are based on the original series Y(t). Let k= number of parameters in the 
model, n = number of non-missing residuals. 
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Mean Squared Error 


B(¥ (t) = ry) 


MSE = — 


Mean Absolute Percent Error 


MAPE = 1003) ( (t) -Y ())/¥ (| 


n 


Maximum Absolute Percent Error 


MaxAPE = 100max ( (y (t)-Y (1))/Y (1)|) 


Mean Absolute Error 


MAE = Ely ()-Y (| 


Maximum Absolute Error 


MazAE = max (|v (t) — ¥ (1) 


Normalized Bayesian Information Criterion 


Normalized BIC = In(MSE) 4 ine) 


R-Squared 


Stationary R-Squared 


A similar statistic was used by Harvey (Harvey, 1989). 


where 


The sum is over the terms in which both Z (1) — Z (t) and AZ (t) — AZ are not missing. 


AZ is the simple mean model for the differenced transformed series, which is equivalent to the 
univariate baseline model ARIMA(0,d,0)(0,D,0). 
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For the exponential smoothing models currently under consideration, use the differencing orders 
(corresponding to their equivalent ARIMA models if there is one). 


_J2 > Brown, Holt _jJO s=1 
ae other p= 1 s>l 


Note: Both the stationary and usual R-squared can be negative with range (—oo, 1]. A negative 
R-squared value means that the model under consideration is worse than the baseline model. Zero 
R-squared means that the model under consideration is as good or bad as the baseline model. 
Positive R-squared means that the model under consideration is better than the baseline model. 


Expert Modeling 


Univariate Series 


Users can let the Expert Modeler select a model for them from: 
m All models (default). 

= Exponential smoothing models only. 

m= ARIMA models only. 


Exponential Smoothing Expert Model 
Figure 100-2 


Non-seasonal: fit all 4 non-seasonal ES models 
Seasonal and positive: fit6 ES models (no Brown) 


Seasonal and not-all-positive: fit 5 ES models(no Brown, no 
multiplicative Winters) 


ES EM = smallest BIC model 
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ARIMA Expert Model 


Figure 100-3 


Transformation (none, log or sqrt)? 


Difference order 


Pattern detection (ACF, PACF, 
E ACF) for initial model 


i Delete insignificant 
Fit the model by CLS param etersin 3 stages: 


1.|t{<0.5, 2. [}<1, 3. [t]<2 


Modify model (only 
once) 


Fit the model by ML Delete insignificant parameters 


Diagnostic checking 
Ljung Box, ACF/PACF 


ARIMA EM 


Note: If 10<n<3s, set s=1 to build a non-seasonal model. 


All Models Expert Model 


In this case, the Exponential Smoothing and ARIMA expert models are computed, and the model 
with the smaller normalized BIC is chosen. 


Note: For short series, n<max(20,3s), use Exponential Smoothing Expert Model. 


Multivariate Series 


In the multivariate situation, users can let the Expert Modeler select a model for them from: 


m All models (default). Note that if the multivariate expert ARIMA model drops all the 
predictors and ends up with a univariate expert ARIMA model, this univariate expert ARIMA 
model will be compared with expert exponential smoothing models as before and the Expert 
Modeler will decide which is the best overall model. 


m= ARIMA models only. 
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Transfer Function Expert Model 
Figure 100-4 


Series to forecast: ¥ 
Predictors: X1, X2, ... 


Univariate ARIMA EM for ¥: (pd q@@,D,Q). 


Transform and difference Y . 


Drop X's with missing. Transform X's. Difference X's and I's 


Delete some X by CCF, further difference some X 


a = model 
Fit by CLS and check param eters for each X 


Delete insignificant ARMA param eters 
Fit by CLS and check param eters 


F or each X, find delay, rational TF. 
Delete insignificant ARMA param eters. 


Delete insignificant non- 
Fit the model by CLS hed dolor al ee @utovead 


param eters. 


Fit the model by ML Delete insignificant parameters 
Diagnostic checking 

Ljung Box, ACF/PACF 

Modify ARMA part as in univariate 


Note: For short series, n<max(20,3s), fit a univariate expert model. 
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TWOSTEP CLUSTER Algorithms 


The TwoStep cluster method is a scalable cluster analysis algorithm designed to handle very large 
data sets. It can handle both continuous and categorical variables or attributes. It requires only one 
data pass. It has two steps 1) pre-cluster the cases (or records) into many small sub-clusters; 2) 
cluster the sub-clusters resulting from pre-cluster step into the desired number of clusters. It can 
also automatically select the number of clusters. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


Table 101-1 


Notation 
Notation 


KA 
KP 


Lr 


Description 
Total number of continuous variables used in the procedure. 


Total number of categorical variables used in the procedure. 


Number of categories for the kth categorical variable. 

The range of the kth continuous variable. 

Number of data records in total. 

Number of data records in cluster k. 

The estimated mean of the kth continuous variable across the entire dataset. 


The estimated variance of the kth continuous variable across the entire 
dataset. 


The estimated mean of the kth continuous variable in cluster j. 
The estimated variance of the kth continuous variable in cluster j. 
Number of data records in cluster j whose kth categorical variable takes 


the [th category. 


Number of data records in the kth categorical variable that take the /th 
category. 


Distance between clusters j and s. 
Index that represents the cluster formed by combining clusters j and s. 


TwoStep Clustering Procedure 


The TwoStep clustering procedure consists of the following steps: 


> Pre-clustering, 


> Outlier handling (optional), 


> Clustering 
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Pre-cluster 


The pre-cluster step uses a sequential clustering approach. It scans the data records one by one 
and decides if the current record should be merged with the previously formed clusters or starts a 
new cluster based on the distance criterion (described below). 

The procedure is implemented by constructing a modified cluster feature (CF) tree. The CF 
tree consists of levels of nodes, and each node contains a number of entries. A leaf entry (an entry 
in the leaf node) represents a final sub-cluster. The non-leaf nodes and their entries are used to 
guide a new record quickly into a correct leaf node. Each entry is characterized by its CF that 
consists of the entry’s number of records, mean and variance of each range field, and counts for 
each category of each symbolic field. For each successive record, starting from the root node, it is 
recursively guided by the closest entry in the node to find the closest child node, and descends 
along the CF tree. Upon reaching a leaf node, it finds the closest leaf entry in the leaf node. If 
the record is within a threshold distance of the closest leaf entry, it is absorbed into the leaf entry 
and the CF of that leaf entry is updated. Otherwise it starts its own leaf entry in the leaf node. If 
there is no space in the leaf node to create a new leaf entry, the leaf node is split into two. The 
entries in the original leaf node are divided into two groups using the farthest pair as seeds, and 
redistributing the remaining entries based on the closeness criterion. 

If the CF tree grows beyond allowed maximum size, the CF tree is rebuilt based on the existing 
CF tree by increasing the threshold distance criterion. The rebuilt CF tree is smaller and hence 
has space for new input records. This process continues until a complete data pass is finished. 
For details of CF tree construction, see the BIRCH algorithm (Zhang, Ramakrishnon, and Livny, 
1996). 

All records falling in the same entry can be collectively represented by the entry’s CF. When a 
new record is added to an entry, the new CF can be computed from this new record and the old CF 
without knowing the individual records in the entry. These properties of CF make it possible to 
maintain only the entry CFs, rather than the sets of individual records. Hence the CF-tree is much 
smaller than the original data and can be stored in memory more efficiently. 

Note that the structure of the constructed CF tree may depend on the input order of the cases or 
records. To minimize the order effect, randomly order the records before building the model. 


Outlier Handling 


Cluster 


An optional outlier-handling step is implemented in the algorithm in the process of building the 
CF tree. Outliers are considered as data records that do not fit well into any cluster. We consider 
data records in a leaf entry as outliers if the number of records in the entry is less than a certain 
fraction (25% by default) of the size of the largest leaf entry in the CF tree. Before rebuilding the 
CF tree, the procedure checks for potential outliers and sets them aside. After rebuilding the CF 
tree, the procedure checks to see if these outliers can fit in without increasing the tree size. At the 
end of CF tree building, small entries that cannot fit in are outliers. 


The cluster step takes sub-clusters (non-outlier sub-clusters if outlier handling is used) resulting 
from the pre-cluster step as input and then groups them into the desired number of clusters. Since 
the number of sub-clusters is much less than the number of original records, traditional clustering 
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methods can be used effectively. TwoStep uses an agglomerative hierarchical clustering method, 
because it works well with the auto-cluster method (see the section on auto-clustering below). 
Hierarchical clustering refers to a process by which clusters are recursively merged, until 
at the end of the process only one cluster remains containing all records. The process starts by 
defining a starting cluster for each of the sub-clusters produced in the pre-cluster step. (For more 
information, see the topic “Pre-cluster”.) All clusters are then compared, and the pair of clusters 
with the smallest distance between them is selected and merged into a single cluster. After 
merging, the new set of clusters is compared, the closest pair is merged, and the process repeats 
until all clusters have been merged. (If you are familiar with the way a decision tree is built, this 
is a similar process, except in reverse.) Because the clusters are merged recursively in this way, 
it is easy to compare solutions with different numbers of clusters. To get a five-cluster solution, 
simply stop merging when there are five clusters left; to get a four-cluster solution, take the five- 
cluster solution and perform one more merge operation, and so on. 


Accuracy 


In general, the larger the number of sub-clusters produced by the pre-cluster step, the more 
accurate the final result is. However, too many sub-clusters will slow down the clustering during 
the second step. The maximum number of sub-clusters should be carefully chosen so that it is large 
enough to produce accurate results and small enough not to slow down the second step clustering. 


Distance Measure 


A log-likelihood or Euclidean measure can be used to calculate the distance between clusters. 


Log-Likelihood Distance 


The log-likelihood distance measure can handle both continuous and categorical variables. It 

is a probability based distance. The distance between two clusters is related to the decrease 

in log-likelihood as they are combined into one cluster. In calculating log-likelihood, normal 
distributions for continuous variables and multinomial distributions for categorical variables are 
assumed. It is also assumed that the variables are independent of each other, and so are the cases. 
The distance between clusters j and s is defined as: 


d (i,j) = & + & — €a,j) 


where 
K4* K? 

&y = —Ny| S¢ = log (65 + 63,,) +) > Ev 
k=1 k=1 


and 
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Ta a - 

yO Se kl Nok 

uk ~~ Ny log Ny 
i=1 


If +7 is ignored in the expression for éy, the distance between clusters i and j would be exactly the 


decrease in log-likelihood when the two clusters are combined. The @;. term is added to solve the 
problem caused by o?,,= 0, which would result in the natural logarithm being undefined. (This 
would occur, for example, when a cluster has only one case.) 


Euclidean Distance 


This distance measure can only be applied if all variables are continuous. The Euclidean distance 
between two points is clearly defined. The distance between two clusters is here defined by the 
Euclidean distance between the two cluster centers. A cluster center is defined as the vector 

of cluster means of each variable. 


Number of Clusters (auto-clustering) 


TwoStep can use the hierarchical clustering method in the second step to assess multiple cluster 
solutions and automatically determine the optimal number of clusters for the input data. A 
characteristic of hierarchical clustering is that it produces a sequence of partitions in one run: 1, 2, 
3, ... clusters. In contrast, a k-means algorithm would need to run multiple times (one for each 
specified number of clusters) in order to generate the sequence. To determine the number of 
clusters automatically, TwoStep uses a two-stage procedure that works well with the hierarchical 
clustering method. In the first stage, the BIC for each number of clusters within a specified range is 
calculated and used to find the initial estimate for the number of clusters. The BIC is computed as 


7 
BIC(J) = -25°& +m log (N) 
j=l 


where 
Kk? 

my= JX 2K4 + >: (LK — 1) 
k=1 


and other terms defined as in “Distance Measure”. The ratio of change in BIC at each 
successive merging relative to the first merging determines the initial estimate. Let dB/C'(./) be 
the difference in BIC between the model with J clusters and that with (J + 1) clusters, 

dBIC(J) = BIC(J) — BIC(J +1). Then the change ratio for model J is 


dBIC(.I) 
fee 
fa!) dBIC(1) 
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If dBIC'(1) <0, then the number of clusters is set to 1 (and the second stage is omitted). 
Otherwise, the initial estimate for number of clustersk is the smallest number for which 
Ri(J) < 0.04 


In the second stage, the initial estimate is refined by finding the largest relative increase in distance 
between the two closest clusters in each hierarchical clustering stage. This is done as follows: 


> Starting with the model C; indicated by the BIC criterion, take the ratio of minimum inter-cluster 
distance for that model and the next larger model Cy+1, that is, the previous model in the 
hierarchical clustering procedure, 


dmin | Cx) 
Ceainl Cr+ 1 ) 


where Cj is the cluster model containing k clusters and dmin(C) is the minimum inter-cluster 
distance for cluster model C. 


> Now from model Cx-1, compute the same ratio with the following model Cx, as above. Repeat for 
each subsequent model until you have the ratio Ro(2). 


> Compare the two largest Ro ratios; if the largest is more that 1.15 times the second largest, then 
select the model with the largest R> ratio as the optimal number of clusters; otherwise, from those 
two models with the largest Ro values, select the one with the larger number of clusters as the 
optimal model. 


Cluster Membership Assignment 


Records are assigned to clusters based upon the specified outlier handling and distance measure 
options. 


Without Outlier-Handling 


Assign a record to the closest cluster according to the distance measure. 


With Outlier-Handling 


With outlier handling, records are assigned depending upon the distance measure specified. 


Log-Likelihood Distance 


Assume outliers or noises follow a uniform distribution. Calculate both the log-likelihood 
resulting from assigning a record to a noise cluster and that resulting from assigning it to the 
closest non-noise cluster. The record is then assigned to the cluster which leads to the larger 
log-likelihood. This is equivalent to assigning a record to its closest non-noise cluster if the 
distance between them is smaller than a critical value C’ = log (V), where V = [],,, Rin I], Lm. 
Otherwise, designate it as an outlier. 
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Euclidean Distance 


Assign a record to its closest non-noise cluster if the Euclidean distance between them is smaller 
| J Ky 
than a critical value C’ = 2 >, a o*,. Otherwise, designate it as an outlier. 


\ j=lk=1 


Missing Values 


No missing values are allowed. Cases with missing values are deleted on a listwise basis. 
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VARCOMP Algorithms 


The Variance Components procedure provides estimates for variances of random effects under a 
general linear model framework. Four types of estimation methods are available in this procedure. 


Notation 


The following notation is used throughout this chapter. Unless otherwise stated, all vectors are 
column vectors and all quantities are known. 


n Number of observations, n>1 

k Number of random effects, k>0 

mo Number of parameters in the fixed effects, my > 0 

My Number of parameters in the ith random effect, m,; > 0, i=1,...,k 

m Total number of parameters, m = mo + m1 +--+ + ms 

a Unknown variance of the ith random effect, 7? > Q, i=1,...,k 

o Unknown variance of the residual term, same as 77, 445 a >0 

" Unknown variance ratio of the ith random effect, y7 = 07/02, 77 >0, 


2 


i=1,...,k, and Y%41 =1 


y The length n vector of observations 

e The length n vector of residuals 

X; The n x m, design matrix, i=0,1,...,k 

Bo The length mz. vector of parameters of the fixed effects 

Bi The length m,; vector of parameters of the ith random effect, i=1,...,k 


Unless otherwise stated, a pxp identity matrix is denoted as I, a pxq zero matrix is denoted as 
O»xq> and a zero vector of length p is denoted as 0,. 


Weights 


For the sake of clarity and simplicity, the algorithms described in this chapter assume unit 
frequency weight and unit regression weight for all cases. Weights can be applied as described in 
the following two sections. 


Frequency Weight 


The WEIGHT command specifies frequency weights. 
™ Cases with nonpositive frequency are excluded from all calculations in the procedure. 
= Non-integral frequency weight is rounded to the nearest integer. 


m= The total sample size is equal to the sum of positive rounded frequency weights. 
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Regression Weight 


Model 


The REGWGT subcommand specifies regression weights. Suppose the /th case has a regression 
weight w,; > 0 (cases with nonpositive regression weights are excluded from all calculations 

in the procedure). Let W = diag(w ,...,w,,) be the nxn diagonal weight matrix. Then the 
VARCOMP procedure will perform all calculations as if y is physically transformed to W? y and 
X,to W:X,, i=0,1,...,k; and then the pertinent algorithm is applied to the transformed data. 


The mixed model is represented, following Rao (1973), as 


k 
y= X30 T 3 Xj; re 
i=1 
The random vectors 3,,...,3;, and e are assumed to be jointly independent. Moreover, the random 


vector 3; is distributed as N,,,, (0,71, +) for i=1,...,k and the residual vector e is distributed as 
N,,(0,071,,). It follows from these assumptions that y is distributed as N,,(Xobo,07V) where 


i=1 i=1 


Minimum Norm Quadratic Unbiased Estimate (MINQUE) 


Given the initial guess or the prior values y? = aj;(a; > 0), i=1,...,.k+1, the MINQUE ofo are 
obtained as a solution of the linear system of equations: 


Sc=q 

where S = {s;;} is a (k+1)x(k+1) symmetric matrix, q = {q;} is a (k+1) vector, and 
a= (O75 201507 4) Detine 

R= V-!—V-!Xo(X'oV-1Xo) X'oV™! 


The elements of S and q are 


$8Q(X'RX;) i=1,...,k, f=1,...5k 
$8Q(x'R) je ticck f2k44 


SSQ(RX,) i=k+1, j=1....,k 
SSQ(R) i=k+1, j=k+1 


sij= 


and 


hee: SSQ(X' Ry) ve Took 
‘| ssQ(Ry) ae rey 
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where SSQ(A) is the sum of squares of all elements of a matrix A. 


MINQUE(0) 


The prior values are a; = 0, i=1,...,k, and a;,,; = 1. Under this set of prior values, V = I,, and 
R=I, — Xp (X'oXo pa Since this R is an idempotent matrix, some of the elements of S 
and q can be simplified to 

eeu trace(X’|RX,) a ee 


Sk41,j = trace(X’jRX;) j = i epee 
Sk+1k+1 = n—rank(Xo) 


dkt1 =y Ry 


Using the algorithm by Goodnight (1978), the elements of S and q are obtained without explicitly 
computing R. The steps are described as follows: 


Step 1. Form the symmetric matrix: 


X' Xo X'oX, vee X' Xx X' oy 
XiXp RMR iX, --- KX, Kiy 
X',Xo X'.X “a X', Xr X' ky 
yXo yi «+ yXe yy 


Step 2. Sweep the above matrix by pivoting on each diagonal of X'pXp. This produces the 
following matrix: 


G GX 9X1 -*: GX oX, GXoy 

x 1 Xx0G x iRX, ai0.8 x i1RX; x iRy 

X',XoG X,RX, --- X,RX, X',Ry 

y XoG y RX, --+ y RX, y Ry 
where G = XX ~ In the process of computing the above matrix, the rank of Xo is obtained 


as the number of nonzero pivots found. 


Step 3. Form S and q. The MINQUE(0) ofe are ¢ = Sq. 


MINQUE(1) 


k+1 
The prior values are a; = 1, i=1,...,.k+1. Under this set of prior values, V = > x Using 


Giesbrecht (1983), the matrix S and the vector q are obtained through an iterative procedure. 
The steps are described as follows: 


Step 1. Construct the augmented matrix A = [Xo|X,|---|X,/y]. Then compute the 
(m+ 1) x (m+1) matrix T(,41) = A’A. 
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k+l 
Step 2. Define Hi) = > X ,;X; , and Tip)j= AH) A, I=1,...,k. Update T(;,1) to T;,) using the 
i=l 
W Transform given in Goodnight and Hemmerle (1979). The updating formula is 


Tw = Tay) — AAG) Xi(Im, + X1HG) Xi) XL) A 


(I+1) (1+1) 


Step 3. Once T(;) = A HAA = A'V~'As obtained, apply the Sweep operation to the diagonal 
elements of upper left mo x mo submatrix of T,,). The resulting matrix will contain the quadratic 
form y Ry, the vectors y RK Hl cak and the matrices X';RX,, i, j=1,...,k. 


Step 4. Compute the elements of S andq. Since RVR = R, then 


SSQ(RX;) = ir(X’)RX,) = 57 589(x' RX.) G2 hak 
i=1 


ke k 
SSQ(R) =n —rank(Xo) — )>tr(X RX) — S> SSQ(RX,) 
i=l j=l 


k 
SSQ(Ry) = y'Ry — )~ 58Q(y'RX,) 


j=1 


The MINQUE(1) of 6 are « = S~q. 


Maximum Likelihood Estimate (MLE) 


The maximum likelihood estimates are obtained using the algorithm by Jennrich and Sampson 
(1976). The algorithm is an iterative procedure that combines Newton-Raphson steps and Fisher 
scoring steps. 


Parameters 
Bo YW 
The parameter vector is 9 = | +? | where 7? = ; 
o a2 
Ik 
Likelihood Function 


The likelihood function is 
L= L(@) = (20)? Jo2V| a Ss (-y = Xoo) Vy = Xo)/02). 
The log-likelihood function is 


1 = log L = —# log (27) — 2 log (2) — 2 log (y — X80) Voy — XB). 


V|- 


L] 
202 
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Gradient Vector 
(a AX AVE: 


aBo — oF 


Jl _ 1 r'V- XX Vor — dtr(X'iV-X,), t=1,...,k, 


Oy? 202 
OL ee 6 n 
Daz — ott Vo I gG2- 


where r = y — Xo/3y. The gradient vector is 


Jl 

OBo 
dl __ el 
dé = Oy? 

Jl 


Doz 


Hessian Matrix 


071 Ly’ -1 
6-0a; =—aX0V Xo 


O59 Oe 


t= — Ay VX; X';V—IX) i= 1,...,k, 


O77 080 a2 
wots = btr(X'V—-IX|X/jV-X:) —Ar'V-1X;X';VOX,X Vor i=1,...,j=1,... 


Oa2 085 7 z 
Fl = sr VIX;XjVr j=l...,k 

Oa20y- 3a jj J ) ) 
€ J 
071 ee 1 ,./y7-1 

Oa20a7 ~~ Qod sot VUE 


The Hessian matrix is 


v7 fall v7 
OSB OD O39 07 IB 0c" 
d?1 O71 O71 071 
dfd@ ~~ Oy? OD Oy? OF" Oy? 007 
v7 1 v7 v7 
a2 OL, Oo2 Oy" Jaz Vo* 
where 
«2 
071 
O78 
71 
O72 OB : 7 
71 
O72 OGi 
071 
7 orn i077 
O71 
Ov 0% : 
O71 
ae 0%; 
and 
orl 
Da2077 
#6 
orl 
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Fisher Information Matrix 


As E(r)=0,, and£ ( r'V-'r) = no’, the expected second derivatives are 


E3555 One 2S heigk 
B( 584 = -}tr(X'V-1X,X';VX,) ea eo en On 


a7 _ 
B(sotn =0 mo 
feb _ 1 -1 
B( sfebr) = -sbetr(x iV X;) pS, Kk, 
EP; P 
871 = n 
E(s$47) as 2a4 


The Fisher Information matrix is 


«2 
orl __ Omg x(m—mo) On, 
O59 905 5 


21 v1 ? 
E(t) = ee )xmo E(5S4) E(s 
, Al eI 
D mo E (sad) E( x54) 


where 
a1 aia 
niet [Ot > PO 
E(xety) we B (#1) 
and 


Iteration Procedure 


The iterative estimation algorithm proceeds according to the details in the following sections. 


Initial Values 
m Fixed Effect Parameters: 3) = (X'oX0) X'oy 
™ Random Effect Variance Components: For the ith random effect, compute 
Bi = (xxi) X'y. Then assign the variance of the m; elements of 8; using divisor 
(m; — 1) to the estimate ¢? if m; > 2; otherwise a7 = 0. 
m Residual Variance: ¢? = r'r/nwhere X = (Xp|X,|---|X,] andr =y — (x'x) xy. If 
o? = 0 but k>1 then reset c? = 10~* so that the iteration can continue. 


VARCOMP Algorithms 


The variance ratios are then computed as 47 = c?/a?. Following the same method in which the 
residual variance is initialized, 7 > 0 for k>1. 
Updating 
At the sth iteration, s=0,1,..., the parameter vector is updated as 
Oey ) = 6, t pA@s 


where Ad@, is the value of increment Aé evaluated at 6 = 6., and p>0 is a step size such that 
l (4. 4 ) >l (4. i; The increment vector depends on the choice of step type—Newton-Raphson 


versus Fisher scoring. The step size is determined by the step-halving technique with p=1 initially 
and a maximum of 10 halvings. 


Choice of Step 


Following Jennrich and Sampson (1976), the first iteration is always the Fisher scoring step 
because it is more robust to poor initial values. For subsequent iteration the Newton-Raphson 
step is used if: 


1. The Hessian matrix is nonnegative definite, and 


2. The increment in the log-likelihood function of step 1 is less than or equal to one. 


Otherwise the Fisher scoring step is used. The increment vector for each type of step is: 


oy \ nl 
F d?| dl 
m Newton-Raphson Step: Ad = ( - #1) a 


m Fisher Scoring Step: Aé = (-2( i )) “a 


d0d0 dé’ 


Convergence Criteria 


Given the convergence criterion « > 0, the iteration is considered converged when the following 
criteria are satisfied: 


1. 1 (6441)) —l (4.) <e¢X max (1, 6. ) . and 


2 P P P : : 
“ <l (4. ' ») l (6.) >< € X max (1, < 6, >) where <a> is the sum of absolute values of 
elements of the vector a. 


Negative Variance Estimates 


Negative variance estimates can occur at the end of an iteration. An ad hoc method is to set 
those estimates to zero before the next iteration. 


Covariance Matrix 


Let 6 be the vector of maximum likelihood estimates. Their covariance matrix is given by 


VARCOMP Algorithms 


e / F \-1 
cou( 0) = (-£(55)I, ‘) 


Let 
Bo 
9 
O71 
v=/|. 
2 
Ok 
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and their covariance matrix is estimated by 


cou( ¥) = Jcou(4) J 


where 
Lia On, xh On, 
J = O;xmo o2I;, a 
0 0 1 
which is the (mg +k +1) x (mp +k + 1) Jacobian matrix of transforming @ to w. 


Restricted Maximum Likelihood Estimate (REML) 


The restricted maximum likelihood method finds a linear transformation on y such that the 
resulting vector does not involve the fixed effect parameter vector byegardless of their values. It 
has been shown that these linear combinations are the residuals obtained after a linear regression 
on the fixed effects. Suppose r is the rank of Xo; then there are at most n — r linearly independent 
combinations. Let K be an n x (nm — 1) matrix whose columns are these linearly independent 
combinations. Then the properties of K are (Searle et al., 1992, Chapter 6): 


KX — Unencne 
K =TM 


where T is a (mn — 7°) x n matrix with linearly independent rows and 
M = I, a Xo(X'pXo) ae 


It can be shown that REML estimation is invariant to K (Searle et al., 1992, Chapter 6); thus, we 
can choose K such that K K =I,,_, to simplify calculations. It follows that the distribution of 
K y is N,--(0,02K' VK). 
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Parameters 


9 
| where 77 = 


€ 


The parameter vector is 6 = 


Likelihood Function 


The likelihood function of Ky is 


ee —1/2 ae ae | a 
L=L(6) = (2m) "-/? o-K VK exp (-1y K(K VK) K y/o2). 


It can be shown (Searle et al., 1992) that 

, rs re , —l , 
R= V-!-V-!X(X'y9V-1X)) XV =K(K'VK) K 
Thus, the log-likelihood function is 


L=log L = —"F log (27) — 75+ log (a2) - 5 log 


K'VK| ~ gry Ry. 


2a 


Gradient Vector 
oat in sry RX;X |Ry = pr(X' RX, ) i= “ar 


ele 
al en! J (n—-r) 
Io? — It Ry — “a3 


The gradient vector is 
Ol 

dl __ oy? 

ao Jl 
Doz 


Hessian Matrix 
Fl, — -Ly'RX;X ;RX;X ;Ry 4 Lir(X'iRX)X’ jRX;) a Cs eee 


EP Pa} 
Oy? OY 
tn ty 


J? 1" Lit ' —_ 
Jaz? =~ 9549 RX; X jRy jal,...,k 
20; ; 
O71 ee = 
Jatual = —g3¥ Ry + For 
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Fisher Information Matrix 


Since K Xy = O(n —r)xmo and trace(RV) =n —r, the expected second derivatives are 


E( 2) = —htr(X';RX|X'|RX,) 5 ee ee eee 


The Fisher Information matrix is 


; Beet) Bee, 


where 
a ee mes 
stat) =|) <> oF 
Baia) -- B(akke) 

and 


Iteration Procedure 


The iterative estimation algorithm proceeds according to the details in the following sections. 


Initial Values 
™ Random Effect Variance Components: For the ith random effect, compute 
6, = (x';X :) x5 Then assign the variance of the m,; elements of b; using divisor 
(mj; — 1) to the estimate c? if m; > 2; otherwise o? = 0. 
m Residual Variance: ¢? = r'r/n where X = [X|X,|---|X,] andr =y — (x'x) xy. If 
a? = 0 but k=1 then reset ¢? = 10~* so that the iteration can continue. 


The variance ratios are then computed as 47 = c?/a?. Following the same method in which the 
residual variance is initialized, ¢? > 0 for k=1. 
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Updating 
At the sth iteration, s=0,1,..., the parameter vector is updated as 
641) = 9, + pAd, 


where Ad, is the value of increment Aé evaluated at 6 = @,, and p>0 is a step size such that 
l (4. d ) >l (4, iF The increment vector depends on the choice of step type—Newton-Raphson 


versus Fisher scoring. The step size is determined by the step-halving technique with p=1 initially 
and a maximum of 10 halvings. 


Choice of Step 


Following Jennrich and Sampson (1976), the first iteration is always the Fisher scoring step 
because it is more robust to poor initial values. For subsequent iterations the Newton-Raphson 
step is used if: 


1. The Hessian matrix is nonnegative definite, and 


2. The increment in the log-likelihood function of step 1 is less than or equal to one. 
Otherwise the Fisher scoring step is used. The increment vector for each type of step is: 


~ dodé dé 


>, \-1 
m Newton-Raphson Step: Ad = ( dl ) dl 


m Fisher Scoring Step: A = (-2( ft )) — 


d0dé dé’ 


Convergence Criteria 


Given the convergence criterion « > 0, the iteration is considered converged when the following 
criteria are satisfied: 


Given the convergence criterion « > 0, the iteration is considered converged when the following 
criteria are satisfied: 


1. 1 (6(641)) —I (6.) | <€ xX max (1, 6. ) . and 


tN 


<l (4. ' ») l (4.) >< € X max (1, < 6, >) where <a> is the sum of absolute values of 
elements of the vector a. 


Negative Variance Estimates 


Negative variance estimates can occur at the end of an iteration. An ad hoc method is to set 
those estimates to zero before the next iteration. 


Covariance Matrix 


Let 6 be the vector of maximum likelihood estimates. Their covariance matrix is given by 
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cov(0) = (-#(#85)I, ,) 


Let 
2 
O71 
v= x 
% 
2 
o- 


be the original parameters. Their maximum likelihood estimates are given by 
Ada 


and their covariance matrix is estimated by 


cou( ¥) = Ieou(4) J 


where 
o7I, 
s[o" a] 
which is the (4 + 1) x (& + 1) Jacobian matrix of transforming 8 to w. 


ANOVA 


The ANOVA variance component estimates are obtained by equating the expected mean squares 
of the random effects to their observed mean squares. The VARCOMP procedure offers two types 
of sum of squares: Type I and Type III (see Appendix 11 for details). 


Let 
2 
7 
Ww = 
2 
Oo}. 
2 
Oo 


e 


be the vector of variance components. 


Let 
MS, 
‘ . 
: MS}, 
MSE 


where .\/S; is the observed mean squares of the ith random effect, and MSE is the residual 
mean squares. 
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Let 


be a (& + 1) x (k + L) matrix whose rows are coefficients for the expected mean squares. For 
example, the expected mean squares of the ith random effect is s’;y. Algorithms for computing 
the expected mean squares can be found in the section “Univariate Mixed Model” in the chapter 
GLM Univariate and Multivariate. The ANOVA variance component estimates are then obtained 
by solving the system of linear equations: 


Sy=q 
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WLS Algorithms 


WLS estimates regression model with different weights for different cases. 


Notation 


The following notation is used throughout this chapter unless otherwise stated: 


n The number of cases 
The number of parameters for the model 


y nx1 vector with element y;, which represents the observed dependent 
variable for case i 

xX nxp matrix with element x;;, which represents the observed value of the ith 
case of the jth independent variable 

B px1 vector with element 3;, which represents the regression coefficient of 
the jth independent variable 

wW nx1 vector with element w;, which represents the weight for case i 


Model 
The linear regression model has the form of 
y=X,G+6, i=1,...,n 


where X; is the vector of covariates for the ith case, E(c;) = 0, and var(e;) = w; | co’. Assuming 


that ¢,,...,«¢, follow a normal distribution, the log-likelihood function is 
n : : 2 
n > wily ~x;3) 
DL =0.5% —nln2a —nlno? + nw, — =" 
Yo hum, - 
= 


Computational Details 
The algorithm used to obtain the weighted least-square estimates for the parameters in the model 
is the same as the REGRESSION procedure with regression weight. For details of the algorithm 


and statistics (the ANOVA table and the variables in the equation), see REGRESSION. 
After the estimation is finished, the log-likelihood function is estimated by 


L= D4 mins —nlno? 4 So inw; —(n -»} 
i=l 


where c” is the mean square error in the ANOVA table. 


Appendix 


Significance Level of a Standard 
Normal Deviate 


The significance level is based on a polynomial approximation. 


Notation 
The following notation is used throughout this section unless otherwise stated: 
Table A-1 
Notation 
Notation Description 
x Value of the standard normal deviate 
Q One-sided significance level 
Computation 
Q(X) =0.5{1 + Z(a, + Z(a2 + Za Z(a4+ Z(as 4 Zag)))))}~"8 
where 


P 0.7071067812|X if|X < 14.14 
Z= : : 

10 otherwise 
a, = 0.070523078... aq = 0.0001520143 
az = 0.0422820123 as = 0.0002765672 
a3 = 0.0092705272 ag = 0.0000430638 
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Appendix 


Significance Levels for Fisher’s 
Exact Test 


The procedure described in this appendix is used to calculate the exact one-tailed and two-tailed 
significance levels of Fisher’s exact test for a 2x2 table under the assumption of independence 
of rows and columns and conditional on the marginal totals. All cell counts are rounded to the 
nearest integers. 


Background 


Consider the following observed 2%2 table: 


Table B-1 
2 Xx 2 table 
Column 1 Column 2 Column total 
Row 1 nl n2 ni+n2 
Row 2 n3 n4 n3+n4 
Row total ni+n3 n2+n4 N 


Conditional on the observed marginal totals, the values of the four cell counts can be expressed as 
the observed count of the first cell n, only. Under the hypothesis of independence, the count of 
the first cell V, follows a hypergeometric distribution with the probability of NV; = n, givenby 


tng)! 


Prob(.NV; = n1) = (nytno)!(r3 + n4)!(ny +73)! (ne 


N!Iny!ng!ng!n4! 


where .V, ranges from max (0,n,; — m4) tO min (n, + n,m, +n3) and N =n, 4+ng+ng+ng. 
The exact one-tailed significance level p, is defined as 


— _ J Prob(N; > 1) ifm, > E(M) 
Pl Prob(.N; <n.) ifny < E(M) 


where E(N,) = (ny + ne)(r4 4 ng)/N. 


The exact two-tailed significance level p> is defined as the sum of the one-tailed significance level 
p, and the probabilities of all points in the other side of the sample space of N, which are not 
greater than the probability of Ny=n1. 


Computations 


To begin the computation of the two significance levels p; and pz, the counts in the observed 2x2 
table are rearranged. Then the exact one-tailed and two-tailed significance levels are computed 
using the CDF.HYPER cumulative distribution function. 
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Table Rearrangement 


The following steps are used to rearrange the table: 


1. Check whether n,; > F(.NV,), which can be done by checking whether nin, > nang. If so, 
rearrange the table so that the first cell contains the minimum of n2 and nz, maintaining the row 
and column totals; otherwise, rearrange the table so that the first cell contains the minimum of 
n, and ny, again maintaining the row and column totals. 


2. Without loss of generality, we assume that the count of the first cell is n after the above 
rearrangement. Calculate the first row total, the first column total, and the overall total, and name 
them SAMPLE, HITS, and TOTAL, respectively. 


One-Tailed Significance Level 
The following steps are used to calculate the one-tailed significance level: 


1. If TOTAL=0, set the one-tailed significance level pi equal to 1; otherwise, obtain p1 by using the 
CDF.HYPER cumulative distribution function with arguments nj, SAMPLE, HITS, and TOTAL. 


2. Also calculate the probability of the first cell count equal to n; by finding the difference between 
pi and the value obtained from CDF.HYPER with nj-1, SAMPLE, HITS, and TOTAL as its 
arguments, provided that nj>0. Call this probability PEXACT. 


3. Ifn,=0, set PEXACT=p1. PEXACT will be used in the next step to find the points for which the 
probabilities are not greater than PEXACT. 


Two-Tailed Significance Level 


The following steps are used to calculate the two-tailed significance level: 


1. If TOTAL=0, set the two-tailed significance level p2 equal to 1; otherwise, start searching 
backwards from min(nj+n2, ny+n3) to (nj +1), and find the first point x with its point probability 
greater than PEXACT. (Notice that this backward search takes advantage of the unimodal property 
of the hypergeometric distribution.) 


2. If such anx exists between min(n1,+n2, ny+n3) and (n,+1), calculate the probability value obtained 
from CDF.HYPER with arguments x, SAMPLE, HITS, and TOTAL. Call this probability p... 


3. The two-tailed significance level p2 is obtained by finding the sum of pj and(1 —p,.).Ifno 
qualified x exists, the two-tailed significance level is equal to 1. 
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C 


Sorting and Searching 


Sorting and searching have a significant impact on the performance of a number of procedures. 
For those procedures, the methods used are identified here. 


CROSSTABS 


In the general mode, the table of cells is searched using an unordered open scatter table search 
and insertion algorithm similar to Knuth’s Algorithm L (Knuth, 1973, p. 518). The scatter table 
contains only pointers to the actual cell contents and is twice as large as it need be (that is, if 
there is room for m cells, the scatter table has room for 2m pointers). This means it can never be 
more than half full. Collisions are resolved by sequential search from the initial location until 
an empty pointer is found. 


Letting 

k be the table number 

p be the dimension of the table 

v(i), i=1,...,p, be the bit string used to represent the value of the ith variable defining table k 
m be the length of the scatter table 

n be the resulting hash value, to be used as an index in the scatter table 


The hash function used is given by the following algorithm: 
ji=k 
for i:=1 to p 
j:=j rotated left 3 
bits j:=j 
EXCLUSIVE OR 
v(i) 
end 
n:=(j modulo m)+1 


When the tables have been completed, the cells are sorted by table numbers and the values of the 
defining variables using the algorithm described by Singleton (1969). 


FREQUENCIES 


FREQUENCIES uses the same search and sort algorithms as CROSSTABS, except that its 
hashing function is given by: 


h= (((é + 16807v)modulo 2°!) modulom) +1 


Sorting and Searching 
where 
h is the hash value, to be used as an index in the scatter table 
k is the table number 
v is the integer value of the bits representing the value to be tabulated 


m is the length of the scatter table 


NONPAR CORR and NPAR TESTS 


Both use the method of Singleton to sort cases for computing ranks. 


SURVIVAL 
SURVIVAL uses a modified Quicksort similar to Knuth’s algorithm Q (Knuth, 1973, p. 116) to 
sort cases. 
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Generation of Uniform Random 
Numbers 


Two different random number generators are available: 


™ Version 12 Compatible. The random number generator used in version 12 and previous 
releases. If you need to reproduce randomized results generated in previous releases based on 
a specified seed value, use this random number generator. 

™ Mersenne Twister. A newer random number generator that is more reliable for simulation 
purposes. If reproducing randomized results from version 12 or earlier is not an issue, use this 
random number generator. 


Specifically, the Mersenne Twister has a far longer period (number of draws before it repeats) 
and far higher order of equidistribution (its results are “more uniform’) than the IBM® SPSS® 
Statistics 12 Compatible generator. The Mersenne Twister is also very fast and uses memory 
efficiently. 


IBM SPSS Statistics 12 Compatible Random Number Generator 


Uniform numbers are generated using the algorithm of (Fishman and Moore, 1981). It is a 
multiplicative congruential generator that is simply stated as: 


seed(t+1) = (a * seed(t)) 
modulo p rand = seed(t+1) / 
(p+1) 


where a = 397204094 and p = 23!—1 = 2147483647, which is also its period. Seed(t) is a 32-bit 
integer that can be displayed using SHOW SEED. SET SEED=number sets seed(t) to the specified 
number, truncated to an integer. SET SEED=RANDOM sets seed(t) to the current time of day in 
milliseconds since midnight. 


Mersenne Twister Random Number Generator 
The Mersenne Twister (MT) algorithm generates uniform 32-bit pseudorandom integers. The 
algorithm provides a period of 219937—], assured 623-dimensional equal distribution, and 32-bit 
accuracy. Following the description given by Matsumoto and Nishimura (1998), the algorithm is 
based on the linear recurrence: 


Xi: Ln = Xhktm ® (oe [set 1 0 A, k = 0,1,- <8 


where 
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Table D-1 
Notation 

Notation Description 

x is a word vector; a w-dimensional row vector over the two-element field F2 = {0,1} 
n is the degree of recurrence (recursion) 

r is an integer, 0 < r < w —,the separation point of one word 

m is an integer, 1 < m < n, the middle term 

A is a constant w x w matrix with entries in F» 

x} is the upper (w—r) bits of x, 

Xb is the lower r bits of x;,,, ; thus 

xe xd is the word vector obtained by concatenating the upper (w-r) bits of x, and the lower 

r bits of x4) 

D Bitwise addition modulo two (XOR) 

Given initial seeds xp, x),---, x,,—1, the algorithm generates x,, ;,, by the above recurrence for 
k=0, 1, ... 


A form of the matrix A is chosen so that multiplication by A is very fast. A candidate is 


1 


1 
Aw] Gw—-2 ttt tte) ag 
where @ = (@w—1, Gw—257*+,@0) and X = (@w-1,“w—2,+++, 20); then xA can be computed using 
only bit operations 


J shiftright (x) if «#9 =O0 
ia { shiftright(x)Ga if w=l 


Thus calculation of the recurrence is realized with bitshift, bitwise EXCLUSIVE-OR, bitwise 
OR, and bitwise AND operations. 


For improving the k-distribution to v-bit accuracy, we multiply each generated word by a suitable 
w x w invertible matrix T from the right (called tempering in (Matsumoto and Kurita, 1994)). For 
the tempering matrix z = xT, we choose the following successive transformations 
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Table D-2 

Notation 

Notation Description 

ls, t,u are integers 

b,c are suitable bitmasks of word size 
x>>u denotes the «bit shiftright 
x<<u denotes the u-bit shiftleft 


To execute the recurrence, let x[0:n—1]be an array of n unsigned integers of word size, i be an 
integer variable, and x, v,a be unsigned constant integers of word size vectors. 


Step Description 


Step 0 8 ee 
7 u<— ene & bitmask for upper (w—r) bits 
w—- Pr r 


O---O1l---1 ,. . 
ve - ‘ ; bitmask for lower r bits 
@ — Gy—1Ay—2 +++ a0; the last row of matrix A 
Step1 i<0O 
Initialize the state space vector array x [0],x[1],---,x[n— 1]. 


Step2 y<— (x[i] AND u) OR (x[(i +1) modn JAND v; computing (x!'|x! ,,) 
Step 3 If the least significant bit of y equals to zero then 


x[i] — x[(i+m) modn] @ (y >> 1) 90 

If the least significant bit of y equals to one then 

x[i]<—x[(i tm) modn] © (¥>>D ea 
Step4 calculate x [i]T 

y— x[i] 

y<y @ (y>>4) 

y<y® ((y << s) ANDb) 


yay & ((y <<) ANDc) 
yoy © (y>>/) 

Step5  i<—(i+1) modn 

Step6 Goto Step 2. 


IBM SPSS Statistics Usage 


The MT algorithm provides 32 random bits in each draw. IBM® SPSS® Statistics draws 64-bit 
floating-point numbers in the range [0..1] with 53 random bits in the mantissa using 


Draw = (226*[k(t)/25]+[k(t+1)/26])/253 


There are two options for initializing the state space vector array. SET RNG=MT MTINDEX=x 
accepts a 64-bit floating point number x to set the seed. SET RNG=MT MTINDEX=RANDOM uses 
the current time of day in milliseconds since midnight to set the seed. 


init_genrand(unsigned32 s,unsigned32 &x[]) 
{ 


x [0] a 
f = 1812433253; f is an unsigned long interger from i=0 to n 
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x [i+ 1) = f (x[i] >> 30) mod2” 
k[0]: 8*d+4*c+2*bta 

k[1]: y = trunc(z*226) 

k[2]: 2*253 - y*227 


where 
m xis the argument a is 1 if x ==0, or 0 otherwise 
m bis 1if x<0, or0 otherwise 
m cis 1if|x|>=1,or0 otherwise 
m™ dis an integer such that 
m if |x|>1, 5 <= |x/2d<1, 
else if |x| > 0, .5 <= |x|*2d<1 
else x == 0 andd== 0. 
m eis dif |x| <=1, else -d 


m zis |x|*2¢e 
init_by_array(unsigend32 init_key[ ] ,int key_length, unsigned32 &x[]) 


init_genrand(19650218, 
x); I=1, j=0, 
k=max(key_length,n) for 
(k;k--) 
XL] = (xfi]-Cox[i-1]-@x[i-1]>>30))f1)) 
tinit_key[j]+ 
j; if i>=n 
then x[0] = 
x(n-1] 
i=1 
if (>=key_length) 
then j=0 
end for 
for (k=n-1;k;k--) 
XL] = OL 1]&L- 
1]>>30))f2))-i; if i>=n then 
x(O]=x[n-1]; 
i=1; 
end for 


f1=1664525 is an unsigned long 
interger; f2=1566083941 is an 
unsigned long interger; 
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Grouped Percentiles 


Two summary functions, GMEDIAN and GPTILE are used in procedures such as Frequencies 
and Graph, to calculate the percentiles for the data which are grouped by specifying a value for 
each grouping. It is assumed that the actual data values give represent midpoints of the grouped 
intervals. 


Notation 


The following notation is used throughout this section unless otherwise stated: 


Table E-1 

Notation 

Notation Description 

TiS... aK Distinct observed values with frequencies (caseweights) c1,..., ci 

k Number of distinct observed data points 

Pp percentile/100 (a number between 0 and 1) 

ce] Cumulative frequency up to and including 2, 
1-1 

cep = G+0.5*c¢l=1,...,n 
w=1 


Finding Percentiles 


k 
To find the 100pth grouped percentile, first find i such that cc;_; < wp < cc;, where w=) aE 
j=l 


the total sum of caseweights. Then the grouped percentile is 


(1— R)aj_4 t Rx; 
where 
R= wbreiai 


cc; —CCy_-1 


Note the following: 


m@ If wp < cc, the grouped percentile is system missing and a warning message “Since the lower 
bound of the first interval is unknown, some percentiles are undefined” is produced. 


m If wp > cc;, the grouped percentile is system missing and a warning message “Since the upper 
bound of the last interval is unknown, some percentiles are undefined” is produced. 


m If wp =cc;,, the grouped percentile is equal to z;.. 


Appendix 


Indicator Method 


The indicator method is used in the GENLOG and the GLM procedures to generate the design 
matrix corresponding to the design specified. Under this method, each parameter (either 
non-redundant or redundant) in the model corresponds uniquely to a column in the design matrix. 
Therefore, the terms parameter and design matrix column are often used interchangeably without 


ambiguity. 
Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table F-1 
Notation 
Notation Description 
n Number of valid observations 
Pp Number of parameters 
x nxp design matrix (also known as model matrix) 
Vij Elements of X 
Row Dimension 


The design matrix has as many rows as the number of valid observations. In the GLM procedure, 
an observation is a case in the data file. In the GENLOG procedure, an observation is a cell. 
In both procedures, the observations are uniquely identified by the factor-level combination. 
Therefore, rows of the design matrix are also uniquely identified by the factor-level combination. 


Column Dimension 


The design matrix has as many columns as the number of parameters in the model. Columns of 
the design matrix are uniquely indexed by the parameters, which are in turn related to factor-level 
combinations. 


Elements 


A factor-level combination is contained in another factor-level combination if the following 
conditions are true: 


m= All factor levels in the former combination appear in the latter combination. 


m There are factor levels in the latter combination which do not appear in the former combination. 


For example, the combination [A=1] is contained in [A=1]*[B=3] and so is the combination 
[B=3]. However, neither [A=3] nor [C=1] is contained in [A=1]*[B=3]. 
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Indicator Method 


The design matrix X is generated by rows. Elements of the ith row are generated as follows: 
m Ifthe jth column corresponds to the intercept term, then «;; = 1. 


m If the jth column is a parameter of a factorial effect which is constituted of factors only, 
then x;; = 1 if the factor-level combination of the jth column is contained in that of the ith 
row. Otherwise 2;; = 0. 


m If the jth column is a parameter of an effect involving covariates (or, in the GLM procedure, a 
product of covariates), then z;; is equal to the covariate value (or the product of the covariate 
values in GLM) of the ith row if the levels combination of the factors of the jth column is 
contained in that of the ith row. Otherwise «;; = 0. 


Redundancy 


A parameter is redundant if the corresponding column in the design matrix is linearly dependent 
on other columns. Linear dependent columns are detected using the SWEEP algorithm by Clarke 
(1982) and Ridout and Cobby (1989). Redundant parameters are permanently set to zero and 
their standard errors are set to system missing. 
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Appendix 


Post Hoc Tests 


Post hoc tests are available in more than one procedure, including ONEWAY and GLM. 


Notation 

The following notation is used throughout this section unless otherwise stated: 
Table G-1 
Notation 

Notation Description 

k Number of levels for an effect 

Ni Number of observations at level i 

2; Mean at level i 

Si Standard deviation of level i 


Vi Degrees of freedom for level i, n; — 1 


Spp Square root of the mean square error 
é Experimentwise error rate under the complete null hypothesis 
a Comparisonwise error rate 
r Number of steps between means 
f k 
Degrees of freedom for the within-groups mean square Lf (n; — 1) 
i=1 
Ving Absolute difference between the ith and jth means 7; — 7; 
* (Poa py fo 
k k(k —1)/2 
Qi,g en fa ( My, Sa: ) 
“PPNE 2\ ng nj 
Mh Harmonic mean of the sample size n;, = k 


Qn Spp/J/Mn 


Studentized Range and Studentized Maximum Modulus 


Let x1,22,...,v, be independent andidentically distributed NV (j:,7). Lets, bean estimate 
of o with m degrees of freedom, which is independent of the {x;}, and ms?,/o? ~ \*. Then 
the quantity 


S max(71,...,2,)—min(2@,...,0,) 


“rem — 


Sm 


is called the Studentized range. The upper-e critical point of this distribution is denoted by 5S, ,,.,. 


Post Hoc Tests 


The quantity 


ax(|r1],--/¢r 
M,m = max (| ],---4/2]) 


Sm 


is called the Studentized maximum modulus. The upper-e critical point of this distribution 
is denoted as MJ- ym. 


Methods 


The tests are grouped as follows according to assumptions about sample sizes and variances. 


Equal Variances 


The tests in this section are based on the assumption that variances are equal. 


Waller-Duncan t Test 


The Waller-Duncan t test statistic is given by 
Vij = Ti — Fj > te (wFgf) SV2in 


where t 5(w F; g f) is the Bayesian t value that depends on w (a measure of the relative seriousness 
of a Type I error versus a Type II error), the F statistic for the one-way ANOVA, 


Li. MES ithe 
F= MSencn 
and 

2? ‘a 

9 = M Senor 


Here f = k(n—1) andgq=k—-1. MSeppor and AZ S;,eq¢ are the usual mean squares in the 
ANOVA table. 


Only homogeneous subsets are given for the Waller-Duncan t test. This method is for equal 
sample sizes. For unequal sample sizes, the harmonic mean n,, is used instead of n. 


Constructing Homogeneous Subsets 


L 
2. 


3. 


For many tests assuming equal variances, homogeneous subsets are constructed using a range 
determined by the specific test being used. The following steps are used to construct the 
homogeneous subsets: 


Rank the k means in ascending order and denote the ordered means as or pon ( ry" 
Determine the range value, F..;,;, for the specific test, as shown in Range Values. 


If 2.) —2/,) > QnRe.x., 7, there is a significant range and the ranges of the two sets of k-1 means 
(*) “@ ; 
= (1) and {Z(9)> {8 X(,)3 are compared with (), R,.;,_ 1, Smaller subsets of means 


{" (1) si 


Post Hoc Tests 


are examined as long as the previous subset has a significant range. For some tests, (;,; is used 
instead of (2;,. For more information, see the topic “Range Values”. 


4. Each time a range proves nonsignificant, the means involved are included in a single group—a 
homogeneous subset. 


Range Values 


Following are range values for the various types of tests. 


Student-Newman-Keuls (SNK) 


Tukey’s Honestly Significant Difference Test (TUKEY) 
Rar, f _ S. Ayf 
The confidence intervals of the mean difference are calculated using Q;,; instead of Q),. 


Tukey’s b (TUKEYB) 


S. r, ft S, kf 
Rex f = : 2 : 


Duncan’s Multiple Range Test (DUNCAN) 


Rev f = aces where ar = 1-(1- Sle 


Scheffé Test (SCHEFFE) 


Rerf = \/2(k— 1) Fi_- (k— 1, f) 

The confidence intervals of the mean difference are calculated using Q;,; instead of Q),. 
Hochberg’s GT2 (GT2) 

R, iF = V2M, ik on 


The confidence intervals of the mean difference are calculated using Q;_; instead of Q),. 


Gabriel’s Pairwise Comparisons Test (GABRIEL) 


The test statistic and the critical point are as follows: 


= mls 2 1 1 ] i 
|t;i — Bj| 2 Spp(j= t Tan Mek sf 
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For homogeneous subsets, 1), is used instead of n; and n;. The confidence intervals of the mean 
difference are calculated based on the above equation. 


Least Significant Difference (LSD), Bonferroni, and Sidak 


For the least significant difference, Bonferroni, and Sidak tests, only pairwise confidence intervals 
are given. The test statistic is 


i Tj > Qi jRen st 


where the range, A... , for each test is provided below. 


Least Significant Difference (LSD) 


Rar f = V 2F ia (1, f) 


Bonferroni t Test (BONFERRONI or MODLSD) 


R OE (1, f) 


e,7,f = V 


i [D4 
where a = e¢/h*. 


Sidak t Test (SIDAK) 


Rex f = (/2F im elF 


2 


where a = 1—(1 —c)*@-1), 


Dunnett Tests 


For the Dunnett tests, confidence intervals are given only for the difference between the control 
group and the other groups. 


Dunnett’s Two-Tailed t Test (DUNNETT) 


When a set of new treatments (7) is compared with a control (7), Dunnett’s two-tailed t test is 
usually used under the equal variances assumption. 


For two-tailed tests, 


l 


Vi,0 = |Z; ai To| > di..w8ddy) ao neers 


where d‘ 


‘,., 18 the upper 100¢ percentage point of the distribution of 


T = maxi<i<e {|Ti|} 
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- 
pho yn we. \? 
(Fi—To) dj =0% 521 (Vij — Ti.) 


where T; = — and s7 


SPL dd —_ sk / 
Sad y/ ng bag Yj—o(7i — 1) 


Dunnett’s One-Tailed t Test (DUNNETTL) 


This Dunnett’s one-tailed t test indicates whether the mean at any level is smaller than a reference 
category. 


We qT Z. /4 
@; — Fy > dU kv Sdd Te | a 


where dU, ,, is the upper 100¢ percentage point of the distribution of 
T = maxy<i<k T; 
Confidence intervals are given only for the difference between the control group and the other 


groups. 


Dunnett’s One-Tailed t Test (DUNNETTR) 


This Dunnett’s one-tailed t test indicates whether the mean at any level is larger than a reference 
category. 


E,— To < dLi.,ySddy) its i 
where dj, ,, is the upper 100¢ percentage point of the distribution of 
T = maxi<i<x {Tj} 


Confidence intervals are given only for the difference between the control group and the other 
groups. 
Ryan-Einot-Gabriel-Welsch (R-E-G-W) Multiple Stepdown Procedures 


For the R-E-G-W F test and the R-E-G-W Q test, a new significant level, +,, based on the number 
of steps between means is introduced: 


= {1-a-a ifr<k—-1 
7 € ifr >k-—-1 


Note: For homogeneous subsets, the n; and n; are used for the R-E-G-W F test and the R-E-G-W 
Q test. To apply these methods, the procedures are same as in “Constructing Homogeneous 
Subsets”, using the tests provided below. 


Ryan-Einot-Gabriel-Welsch Based on the Studentized Range Test (QREGW) 


The R-E-G-W Q test is based on 


— = fia ant fa ek / ~ Y 
MaXx;,jcR {(m — %) min (ni, ny} [Bop Sass 
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Ryan-Einot-Gabriel-Welsch Procedure Based on an F Test (FREGW) 


The R-E-G-W F test is based on 


where r = j — i + 1 and summations are over 2 = {i,...,j}. 


Unequal Sample Sizes and Unequal Variances 


The tests in this section are based on assumptions that variances are unequal and sample sizes are 
unequal. An estimate of the degrees of freedom is used. The estimator is 


2 2, 2 
(s7/ng+ss/n ) 
i gli 
U st/n2v;+s4/n20 
Se Ts Seca ae Maka a 


Two means are significantly different if 


and FR, ,.,, depends on the specific test being used, as listed below. 


For the Games-Howell, Tamhane’s T2, Dunnett’s T3, and Dunnett’s C tests, only pairwise 
confidence intervals are given. 


Games-Howell Pairwise Comparison Test (GH) 
R. ct i Sake V2 


Tamhane’s T2 (T2) 


Rev = Fy ae=ty,» where y =1—(1—.«)'/* 


Dunnett’s T3 (T3) 
Rena oar M, whe yv 


Dunnett’s C (C) 


(Se,k.nj—-157/MitSe,kn, 155/75 )/V2 
- 


Rerv = 
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Appendix 


Sums of Squares 


This appendix describes methods for computing sums of squares. 


Notation 


The notation used in this appendix is the same as that in the GLM Univariate and Multivariate 
chapter. 


Type | Sum of Squares and Hypothesis Matrix 


The Type I sum of squares is computed by fitting the model in steps according to the order of 
the effects specified in the design and recording the difference in error sum of squares (ESS) at 
each step. 

By applying the SWEEP operator on the rows and columns of the augmented matrix Z WZ, of 
dimension (p +7) x (p +7), the Type I sum of squares and its hypothesis matrix for each effect 
(except for the intercept effect, if any) is obtained. 


Calculating the Sum of Squares 


The following procedure is used to find the Type I sum of squares for effect F: 

Let the order of effects specified in the design be Fo, Fy, Fo, ..., Fr. The columns of X are 
partitioned into Xo, X4, Xo, ..., Xm, where Xo = 1 oe to the intercept effect Fo, and the 
columns in the submatrix Xj correspond to effect Fj, j=0,1,.. 

Let Fj be the effect F of interest. Let ESSj-1(/) and ESSj(I) e ihe Ith diagonal elements of the rxr 
lower diagonal submatrix of Z WZ after the SWEEP operator is applied to the columns associated 
with Xo, X41, X2, ..., Xj, . When the /th column of Y is used as the dependent variable, the Type I 
sum of squares for effect Fj is ESS ;_;(/) — ESS;(/),] =1,...,r, where ESS_;(/) is defined as 0. 


Constructing a Hypothesis Matrix 
The hypothesis matrix L is constructed using the following steps: 


1. Let Lo be the upper diagonal pxp submatrix of Z’ WZ after the SWEEP operator is applied to 
the columns associated with the effects preceding F. Set the columns and rows of Lo, which are 
associated with the effects preceding F, to 0. 


2. For the rows of Lo associated with the effects ordered after F;, if any, set the corresponding rows 
of Lo to 0. Remove all of the 0 rows in the matrix Lo. The row dimension of Lo is then less than p. 


3. Use row operations on the rows of Lo to remove any linearly dependent rows. The set of all 
nonzero rows of Lo forms a Type I hypothesis matrix L. 


Sums of Squares 


Type I! Sum of Squares and Hypothesis Matrix 
A Type II sum of squares is the reduction in ESS due to adding an effect after all other terms have 


been added to the model except effects that contain the effect being tested. 
For any two effects F and F’, F is contained in F’ if the following conditions are true: 


= Both effects F and F’ involve the same covariate, if any. 
m= F’ consists of more factors than F. 


m All factors in F also appear in F’. 


Intercept Effect. The intercept effect 1, is contained in all the pure factor effects. However, it is not 
contained in any effect involving a covariate. No other effect is contained in the intercept effect. 


Calculating the Sum of Squares 
To find the Type II (and also Type III and IV) sum of squares associated with any effect F, you 
must distinguish which effects in the model contain F and which do not. The columns of X can 
then be partitioned into three groups: X1, X2 and X3, where: 
* Xj consists of columns of X that are associated with effects that do not contain F. 
¢ X»2 consists of columns that are associated with F. 
¢ X3 consists of columns that are associated with effects that contain F. 
The SWEEP operator applied on the augmented matrix Z WZ is used to find the Type II sum 
of squares for each effect. The order of sweeping is determined by the “contained” relationship 
between the effect being tested and all other effects specified in the design. 


Once the ordering is defined, the Type II sum of squares and its hypothesis matrix L can be 
obtained by a procedure similar to that used for the Type I sum of squares. 


Constructing a Hypothesis Matrix 
A hypothesis matrix L for the effect F has the form 
L=(0 CX 9W?M,W?X) CX )W?M,W?X3) 
where 


dL 


M, =I- Ww?X, (x'.Wx,) xX.w3 


cs (x.wiM,W?X,)" 


A* is a g2 generalized inverse of a symmetric matrix A. 


Sums of Squares 


Type Ill Sum of Squares and Hypothesis Matrix 


The Type III sum of squares for an effect F can best be described as the sum of squares for F 
adjusted for effects that do not contain it, and orthogonal to effects (if any) that contain it. 


Constructing a Hypothesis Matrix 


A Type III hypothesis matrix L for an effect F is constructed using the following steps: 


1. The design matrix X is reordered such that the columns can be grouped in three parts as described 
in the Type II approach. Compute H = (x'wx) "xX Wx. Notice that the columns of H can also 
be partitioned into three parts: the columns corresponding to effects not containing F, the columns 
corresponding to the effect F, and the columns corresponding to the effects containing F (if any). 


2. The columns of those effects not containing F (except F) are set to 0 by means of the row 
operation. That is: 


a) For each of those columns that is not already 0, fix any nonzero element in that column and 
call this nonzero element the pivot element. 


b) Divide the row that corresponds to the pivot element by the value of the pivot element itself. 


c) Use row operations to introduce zeros to all other nonzero elements (except the pivot element 
itself) in that column. 


d) Set the whole row containing the pivot element to 0. The column and the row corresponding to 
this pivot element should now be 0. 


e) Continue the process for the next column that is not 0 until all columns corresponding to those 
effects that do not contain F are 0. 


3. For each column associated with effect F, find a nonzero element, use it as pivot, and perform 
the Gaussian elimination method as described in a, b, and c of step 2. After all such columns are 
processed, remove all of the 0 rows from the resulting matrix. If there is no column corresponding 
to effects containing F (which is the case when F contains all other effects), the matrix just 
constructed is the Type III hypothesis matrix for effect F. If there are columns corresponding to 
effects that contain F, continue with step 4. 


4. The rows of the resulting matrix in step 3 can now be categorized into two groups. In one 
group, the columns corresponding to the effect F are all 0; call this group of rows Go. In the 
other group, those columns are nonzero; call this group of rows G1. Notice that the rows in Go 
form a generating basis for the effects that contain F. Transform the rows in Gj such that they 


are orthogonal to any rows in Go. 


Calculating the Sum of Squares 


Once a hypothesis matrix is constructed, the corresponding sum of squares can be calculated by 
(LB) (LGL’) LB. 


Sums of Squares 


Type IV Sum of Squares and Hypothesis Matrix 


A hypothesis matrix L of a Type IV sum of squares for an effect F is constructed such that for 
each row of L, the columns corresponding to effect F are distributed equitably across the columns 
of effects containing F. Such a distribution is affected by the availability and the pattern of the 
nonmissing cells. 


Constructing a Hypothesis Matrix 


A Type IV hypothesis matrix L for effect F is constructed using the following steps: 
1. Perform steps 1, 2, and 3 as described for the Type III sum of squares. 


2. If there are no columns corresponding to the effects containing F, the resulting matrix is a Type 
IV hypothesis matrix for effect F. If there are columns corresponding to the effects containing F, 
the following step is needed. 


3. First, notice that each column corresponding to effect F represents a level in F. Moreover, the 
values in these columns can be viewed as the coefficients of a contrast for comparing different 
levels in F. For each row, the values of the columns corresponding to the effects that contain F are 
based on the values in that contrast. The final hypothesis matrix L consists of rows with nonzero 
columns corresponding to effect A. For each row with nonzero columns corresponding to effect F: 


a) If the value of any column (or level) corresponding to effect F is 0, set to 0 all values of columns 
corresponding to effects containing F and involving that level of F. 


b) For columns (or levels) of F that have nonzero values, count the number of times that those 
levels occur in one or more common levels of the other effects. This count will be based on 
the availability of the nonmissing cells in the data. Then set each column corresponding to an 
effect that contains F and involves that level of F to the value of the column that corresponds to 
that level of F divided by the count. 


c) If any value of the column corresponding to an effect that contains F and involves a level 
(column) of F is undetermined, while the value for that level (column) of F is nonzero, set the 
value to 0 and claim that the hypothesis matrix created may not be unique. 


Calculating the Sum of Squares 


Once a hypothesis matrix is constructed, the corresponding sum of squares can be calculated by 


ee) ( LGL’) LB. The corresponding degrees of freedom for this test is the row rank of 
the hypothesis matrix. 


Appendix 


Distribution and Special 
Functions 


The functions described in this appendix are used in more than one procedure. They are grouped 
into the following categories: 


™ Continuous Distributions. Beta, Cauchy, chi-square, exponential, F, gamma, Laplace, 
logistic, lognormal, normal, noncentral beta, noncentral chi-square, noncentral F, noncentral 
Student’s t, Pareto, Student’s t, uniform, and Weibull 


™ Discrete Distributions. Bernoulli, binomial, geometric, hypergeometric, negative 
binomial, and Poisson 


™ Special Functions. Gamma function, beta function, incomplete gamma function 
(ratio), incomplete beta function (ratio), and standard normal function 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table |-1 
Notation 
Notation Description 
f(x) Density function of continuous random variable X or probability mass 
function of discrete random variable X 
F(x) Cumulative distribution function of continuous or discrete variable X 
Fo (2) Inverse cumulative distribution function of X 


Continuous Distributions 


These are functions of a single scale variable. 


Beta 


The beta distribution takes values in the range 0<x<1 and has two shape parameters, a and B. Both 
a. and B must be positive, and they have the property that the mean of the distribution is a/(a+). 


Common uses. The beta distribution is used in Bayesian analyses as a conjugate to the binomial 
distribution. 


Functions. The CDF, IDF, PDF, NCDF, NPDF, and RV functions are available. 


The beta distribution has PDF, CDF, and IDF 


Cl mne Ay a 1 a1 = B-1 
f(xzj;a,B)= Bias)” (1 xr) 
F(2r:a,3) = IB(zr;a, 3) 


Distribution and Special Functions 
F-'(p;a,3) = IB“! (pa, 3) 


1 
where B (a,b) = i a’—'(1—a)"'dx is the beta function and 
0 


IB (w;a,b) = i, a t’-!(1 —1)’~ ‘dt is the incomplete beta function. 
o Bla,b) 


Relationship to other 
distributions. 


m When a=/=1, the beta(a,f) distribution is equivalent to the uniform(0,1) distribution. 


m The beta(a,f) distribution is the distribution of X/(X+Y) where X and Y are variables that have 
chi-square distributions with degrees of freedom parameters 2a and 2, respectively. 


Random Number Generation 
Special case (a=1 or b=1) 
1. Generate U from Uniform(0,1). 
If a=1, set Bb nn ee a Rea 
If b=1, set y — yl/e. 


ee Oo NS 


If both a=1 and b=1, set X=U. 
Algorithm BN due to Ahrens and Dieter (1974) fora > 1 andb>1 
1. Sete=a—1,f =b—-1,c=e+ f,g =cln(c),u=e/c,and s=0.5//c. 
2. Generate Y from N(0,1) and set X = sY + u, 
3. If X < OorX > 1, go to step 2. 
4. Generate U from Uniform(0,1). 
5. IfIn(U) < (eln(X/e) + fln((1—X)/f) + g +0.5Y7), finish; otherwise go to step 2. 
References. CDF: AS 63 (1973); ICDF: AS 64 (1973) and AS 109 (1977); RV: AS 134 (1979), 
(Ahrens and Dieter, 1974), and (Cheng, 1978). (See the Algorithm Index and References.) 
Bivariate Normal 


The bivariate normal distribution takes real values and has one correlation parameter, p, which 
must be between —1 and 1, inclusive. 


Functions. The CDF and PDF functions are available and require two quantiles, x1 and x2. 


The bivariate normal distribution has PDF 


f(x4, «93 p) = 1__ exp (5; rs) (a? 2px, Ly + 3)) 


27(1—p?) 


The CDF does not have a closed form and is computed by approximation. 


Distribution and Special Functions 


Relationship to other distributions. 


m Two variables with a bivariate normal(p) distribution with correlation p have marginal normal 
distributions with a mean of 0 and a standard deviation of 1. 


Numerical algorithms for computing the CDF are described in the references. 


References. AS 462 (1973) and AS 195. (See the Algorithm Index and References.) 


Cauchy 


The Cauchy distribution takes real values and has a location parameter, 0, and a scale parameter, 
¢; ¢ must be positive. The Cauchy distribution is symmetric about the location parameter, but 
has such slowly decaying tails that the distribution does not have a computable mean. 


Functions. The CDF, IDF, PDF, and RV functions are 

available. The Cauchy distribution has PDF, CDF, and IDF 
ea 

(2;0,<)= (1 + (2=*) ) 


F(«;0,s)= 44 +tan7!(==*) 


F-1(p;0,¢) = @+¢tan(m(p—1/2)) 


Relationship to other distributions. 


m A “standardized” Cauchy variate, (x—0)/c, has a t distribution with 1 degree of freedom. 
Random Number Generation 


Inverse CDF algorithm 
1. Generate U from Uniform(0,1). 


2. Set X¥ =a+btan(7(U —1/2)) 


Chi-Square 


The chi-square(v) distribution takes values in the range x>=0 and has one degrees of freedom 
parameter, v; it must be positive and has the property that the mean of the distribution is v. 


Functions. The CDF, IDF, PDF, RV, NCDF, NPDF, and SIG functions are available. 
The chi-square distribution has PDF, CDF, and IDF 


f(aiv) = sem 2)—l¢ =x f2 


Distribution and Special Functions 


F-\(p;v) =21G7! (p; *) 


t’~'e~‘dt is the 


where ['(a) = r’—le~*dxr is the gamma function and IG; a) = | Pla) 
0 a 


; 0 : 
incomplete gamma function. 


Relationship to other distributions. 


m@ = The chi-square(v) distribution is the distribution of the sum of squares of v independent 
normal(0,1) random variates. 


m= The chi-square(v) distribution is equivalent to the gamma(v/2, 1/2) distribution. 
Random Number Generation 
Generate X from the Gamma(qa/2, 1/2) distribution. 
References. CDF: CACM 299 (1967); ICDF: AS 91 (1975), AS R85(1991), and CACM 451 
(1973). (See the Algorithm Index and References.) 
Exponential 


The exponential distribution takes values in the range x>=0 and has one scale parameter, B, which 
must be greater than 0 and has the property that the mean of the distribution is 1/B. 


Common uses. In life testing, the scale parameter a represents the rate of decay. 
Functions. The CDF, IDF, PDF, and RV functions are available. 
The exponential distribution has PDF, CDF, and IDF 


f(a; 3) = Be78* 


Relationship to other distributions. 


m The exponential() distribution is equivalent to the gamma(1,/) distribution. 
Random Number Generation 
Inverse CDF algorithm 


Generate U from Uniform(0,1); X = —In(1—U)/a. 


The F distribution takes values in the range x>=0 and has two degrees of freedom parameters, v1 
and v2, which are the “numerator” and “denominator” degrees of freedom, respectively. Both 
vl and v2 must be positive. 


Distribution and Special Functions 


Common uses. The F distribution is commonly used to test hypotheses under the Gaussian 
assumption. 


Functions. The CDF, IDF, IDF, RV, NCDF, NPDF, and SIG functions are available. 


The F distribution has PDF, CDF, and IDF 


: vy /2 4 —(vit+v2)/2 
van ee ee ee 2 ; (41 /2)— Vin 
f(a, 72) = Batam (2) (1 ve 2) 


a Pat, we. VL V2 
F(2;1, 72) aa IB; frye? 2° 2 ) 


=-1 fe fe 

aI—1/.. V2 IB (piv /2, ve/2) 
Novo) = ! ! 

L (p31, 72) V1 (+.-— 2, v2/2) 


1 
where B (a,b) = i v’—'(1—x)’"'dx isthe beta function and 
0 


IB (x;a,b) = | — to L(q = t)?-lat is the incomplete beta function. 
9 B(a,b) 


Relationship to other 
distributions. 


m The F(v/,v2) distribution is the distribution of (X/v1)/(Y/v2), where X and Y are independent 
chi-square random variates with v/ and v2 degrees of freedom, respectively. 


Random Number Generation 


Using the chi-square distribution 
1. Generate Y and Z independently from chi-square(a) and chi-square(b), respectively. 
2. Set X=(Y/a)/ (Z/b). 
References. CDF: CACM 332 (1968). ICDF: use inverse of incomplete beta function. (See the 
Algorithm Index and References.) 
Gamma 


The gamma distribution takes values in the range x>=0 and has one shape parameter, a, and one 
scale parameter, 8. Both parameters must be positive and have the property that the mean of 
the distribution is a/B. 


Common uses. The gamma distribution is commonly used in queuing theory, inventory control, 
and precipitation processes. 


Functions. The CDF, IDF, PDF, and RV functions are available. 
The gamma distribution has PDF, CDF, and IDF 
f(xja,3)= Poet le —Bx 


F(v;a,8) =1G(Sax;a) 


Distribution and Special Functions 


F-1(p;a, B)= 4IG7*(p;a) 


Co 


: : ; oe il : 
where ['(a) = a*~te~*dx is the gamma function and IG; a) = / Tae —dt is the 
0 a 


0 
incomplete gamma function. 


Relationship to other distributions. 
m When a=1, the gamma(a,f) distribution reduces to the exponential() distribution. 
m When /=1/2, the gamma(a,f) distribution reduces to the chi-square(2q) distribution. 


m When ais an integer, the gamma distribution is also known as the Erlang distribution. 
Random Number Generation 

Special case 

If a= 1 and b > 0, generate X from an exponential distribution with parameter b. 


Algorithm GS due to Ahrens and Dieter (1974) for 0<a<1 and b=1 


bk 


Generate U from Uniform(0,1). Set c=(e+a)/e, where e=exp(1). 
Set P=cU. If P>1, go to step 4. 
(P<1) Set x — p!/«. Generate V from Uniform(0,1). If V>exp(—x), go to step 1; otherwise finish. 


(P>1) Set X=—In((c—P)/a). If X<0, go to step 1; otherwise go to step 5. 


wR WwW DN 


Generate V from Uniform(0,1). If y + x«~', go to step 1; otherwise finish. 


Algorithm due to Fishman (1976) for a>1 and b=1 
1. Generate Y from Exponential (1). 
2. Generate U from Uniform(0,1). 
3. If mU<(a—1)(1—-Y+InY), X=aY; otherwise go to Step 1. 


References. CDF: AS 32 (1970) and AS 239 (1988); ICDF: Use the relationship between gamma 
and chi-square distributions. RV: (Ahrens et al., 1974), (Fishman, 1976), and (Tadikamalla, 1978). 
(See the Algorithm Index and References.) 


Half-normal 


The half-normal distribution takes values in the range x>=p and has one location parameter, pL, 
and one scale parameter, o. Parameter o must be positive. 


Functions. The CDF, IDF, PDF, and RV functions are available. 


The half-normal distribution has PDF, CDF, and IDF 


f Cyee) Ve 


Distribution and Special Functions 


F(x;p,c) = 26(=—F) —1 


Fo! (p;p,o) =p +o071( +P | 


Relationship to other distributions. 
m If X has a normal(u,o) distribution, then |X—| has a half-normal(w,o) distribution. 


Random Number Generation 
1. Generate X from a normal(a,b) distribution. 


2. Then |X—a| has a half normal distribution. 


Inverse Gaussian 


The inverse Gaussian, or Wald, distribution takes values in the range x>0 and has two parameters, 
wand A, both of which must be positive. The distribution has mean i. 


Common uses. The inverse Gaussian distribution is commonly used to test hypotheses for model 
parameter estimates. 


Functions. The CDF, IDF, PDF, and RV functions are available. 


The inverse Gaussian distribution has PDF and CDF 


¥ _ N 1/2 — A(r=p)? 
f (osu) = (zs)! "exp(-*) 


F (a; p,) = 0(/3(-1 | z)) 1 pl2r nae(—/3(1 :)) 


The IDF is computed by approximation. 
Inverse CDF Approximation 
For the upper tail, an inverse Gaussian variable X can be approximated by 
r 5} + 
x= @ XB 
where 
0D 
a = 3a“ /4b 
B= 2/3 
vu = 8b/9a 
For the lower tail, one can use the approximation 


X = (ax? + g)* 


Distribution and Special Functions 


where 


a = (3b + 8a) / [4b (b+ 2a)| 
3 = (b+ 8a) / [a (3b + 8a)| 
v = 8(b+ 2a)°/ [a(S 3b)"| 


Random Number Generation 
1. Generate a standard normal variate Z. 
2. Let w= aZ? 
3. Letya=a} (w — \/w (464 w)) 
4. Let p= a/ (a+ vr) 


5. Then the inverse Gaussian variate will take value x with probability 1 — p and value a /x with 
probability p. 


References.(Mudholkar and Natarajan, 1999) and (Michael, Schucany, and Haas, 1976). (See the 
Algorithm Index and References.) 
Laplace 


The Laplace distribution takes real values and has one location parameter, i, and one scale 
parameter, §. Parameter $ must be positive. The distribution is symmetric about p and has 
exponentially decaying tails. 


Functions. The CDF, IDF, PDF, and RV functions are available. 
The Laplace distribution has PDF, CDF, and IDF 


f(aip, 3) = Fe lee 3 


Random Number Generation 


Inverse CDF algorithm 
1. Generate Y and U independently from Exponential(1/ b) and Uniform(0,1), respectively. 


2. If U>0.5, set X=at+Y; otherwise set X=a—Y. 


Logistic 


1. 
2 


Distribution and Special Functions 


The logistic distribution takes real values and has one location parameter, 1, and one scale 
parameter, ¢. Parameter ¢ must be positive. The distribution is symmetric about p and has longer 
tails than the normal distribution. 


Common uses. The logistic distribution is used to model growth curves. 
Functions. The CDF, IDF, PDF, and RV functions are available. 

The logistic distribution has PDF, CDF, and IDF 

f(xips) = re @—w)/s(1 e-(@-H)/s)~ 


F(z;p;S) = a= 


F71(p;1,¢) =u + ¢In (4) 
Random Number Generation 
Inverse CDF algorithm 
Generate U from Uniform(0,1). 


Set X =a+ bln(U/(1-U)). 


Lognormal 


The lognormal distribution takes values in the range x>=0 and has two parameters, 1n and o, both 
of which must be positive. 


Common uses. Lognormal is used in the distribution of particle sizes in aggregates, flood flows, 
concentrations of air contaminants, and failure time. 


Functions. The CDF, IDF, PDF, and RV functions are available. 
The lognormal distribution has PDF, CDF, and IDF 


Ey kts = 1 ,— (In (a /n))?/(207) 
f(z3n,0) = — ok 


F(x:n.0) = ®(1m (2) 


b-l(n 
F-'(p;n,o) = nev? (?) 


Relationship to other distributions. 


m If X has a lognormal(y,o) distribution, then In(X) has a normal(In(7),o) distribution. 
Random Number Generation 


Inverse CDF algorithm 


Distribution and Special Functions 


1. Generate Z from N(0,1). 


2. Set X =aexp(bZ). 


Noncentral Beta 


The noncentral beta distribution is a generalization of the beta distribution that takes values in the 
range 0<x<1 and has an extra noncentrality parameter, A, which must be greater than or equal to 0. 


Functions. 


The noncentral beta distribution has PDF, CDF, and IDF 


: Sole 1 I yp tt —2)P 
Flrsa.3.d) =o 5(5) Bla £733) 


j=0° 


1 
where B (a,b) = - v'~'(1—.«)’"'dx — is the beta function and 
0 


IB (x;a,b) = i : t?-1(1 —t)? ‘dt is the incomplete beta function. 
0 B(a,b) 


Relationship to other 
distributions. 
m When 4 equals 0, this distribution reduces to the beta distribution. 


m The noncentral beta(a,f,4) distribution is the distribution of X/(X+Y) where X is a variable 
that has a noncentral chi-square(2a,/) distribution, and Y is a variable that has a central 
chi-square(2/) distribution. 


References. CDF: (Abramowitz and Stegun, 1970) Chapter 26, AS 226 (1987), and AS R84 
(1990). (See the Algorithm Index and References.) 

Noncentral Chi-Square 
The noncentral chi-square distribution is a generalization of the chi-square distribution that takes 
values in the range x>=0 and has an extra noncentrality parameter, A, which must be greater 
than or equal to 0. 


Functions. 


The noncentral chi-square distribution has PDF and CDF 


pv [2 tj-lp-« 2 


; 2.1 /A\ —/2_% 
sain.) = 5-5 (3) : Qv/2+iT(v/24+ 7) 


Distribution and Special Functions 


1 


a-—1_-t ‘ th 
Ta) e ‘dtis the 


zx 
where I’ (a) a ““~le-"dy — is the gamma function and IG; a) = i: 
0 
incomplete gamma function. 


Relationship to other distributions. 
m When A equals 0, this distribution reduces to the chi-square distribution. 


m= The noncentral chi-square(v,/) distribution is the distribution of the sum of squares of v 
independent normal(,:,1) random variates. Then \ = “p>. 


References. CDF: (Abramowitz et al., 1970) Chapter 26, AS 170 (1981), AS 231 (1987). Density: 
AS 275 (1992). (See the Algorithm Index and References.) 


Noncentral F 


The noncentral F distribution is a generalization of the F distribution that takes values in the range 
x>=0 and has an extra noncentrality parameter, 2, which must be greater than or equal to 0. 


Functions. 


The noncentral F distribution has PDF and CDF 


, = 1 A sere (v,/b)s/7*4 V1 /2+j-1 Vy —((v1+b)/2+ 9) 
Feed) = S055) Or Bedraagaaye (Lt Fa) 


j=0%" S 


aes 1 r eee 9 1,20 Vy : b 
Fein.) = 5 (5) € w(; ak 5 | 13 


j=0° 


1 
where B (a,b) = i; v’-'(1—a)’"'dx — isthe beta function and 
0 


IB (v;a,b) = i: a 6 t)>tat is the incomplete beta function. 
9 Bla,b) 


Relationship to other distributions. 
m When A equals 0, this distribution reduces to the F distribution. 


m The noncentral F distribution is the distribution of (X/v1)/(Y/v2), where X and Y are 
independent variates with noncentral chi-square(v/, 4) and central chi-square(v2) distributions, 
respectively. 


References. CDF: (Abramowitz et al., 1970) Chapter 26. (See the Algorithm Index and 
References.) 


Distribution and Special Functions 


Noncentral Student’s t 


Normal 


The noncentral t distribution is a generalization of the t distribution that takes real values and has 
an extra noncentrality parameter, 1, which must be greater than or equal to 0. When A equals 0, 
this distribution reduces to the t distribution. 


Functions. 


The noncentral t distribution has PDF and CDF 
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1 
where B (a,b) = ij rales a) l da is the beta function and 
0 


IB (v;a,b) = : f-*(1= t)’~ ‘dt is the incomplete beta function. 
B(a,l 
9 Bi(a,b) 


Relationship to other 
distributions. 


m The noncentral t(v,4) distribution is the distribution of X/Y, where X is a normal(A,1) variate 
and Y is a central chi-square(v) variate divided by v. 


Special case 
F(0) =1—8(c) 


References. CDF: (Abramowitz et al., 1970) Chapter 26, AS 5 (1968), AS 76 (1974), and AS 243 
(1989). (See the Algorithm Index and References.) 


The normal, or Gaussian, distribution takes real values and has one location parameter, 1, and 
one scale parameter, o. Parameter o must be positive. The distribution has mean p and standard 
deviation o. 


Functions. The CDF, IDF, PDF, and RV functions are available. 
The normal distribution has PDF, CDF, and IDF 
1 _—(a—s1)? (20°) 


f(a p,o) = 


oVv2n 


F(a; 1,0) = ®(—*) 


o 


F-(p;p,0) = p+ a7! (p) 


Distribution and Special Functions 


Relationship to other distributions. 
m If X has anormal(u,o) distribution, then exp(X) has a normal(exp(w),o) distribution. 
For ® and ®~!, see “Standard Normal” 


Random Number Generation 


Kinderman and Ramage (1976) method 


1. Generate as X=a+bz, where z is an N(0,1) random number. 
References. CDF: AS 66 (1973); ICDF: AS 111 (1977) and AS 241 (1988); RV:CACM 488 
(1974) and (Kinderman and Ramage, 1976). (See the Algorithm Index and References.) 
Pareto 


The Pareto distribution takes values in the range xmin<x and has a threshold parameter, xmin, 
and a shape parameter, a. Both parameters must be positive. 


Common uses. Pareto is commonly used in economics as a model for a density function with a 
slowly decaying tail. 


Functions. The CDF, IDF, PDF, and RV functions are available. 


The Pareto distribution has PDF, CDF, and IDF 


a a+l 
TiLaiy,a) = — (“min ) 
f( min? ) “min I 
F(a ee) LS (“min ) 
+—] —l/a 
PF" (P32 mins) = 2min(l — P) 


Random Number Generation 


Inverse CDF 


1. Generate U from Uniform(0,1). 


2. Set xeaatopy 


Studentized Maximum Modulus 


The Studentized maximum modulus distribution takes values in the range x>0 and has a number 
of comparisons parameter, k*, and degrees of freedom parameter, v, both of which must be 
greater than or equal to 1. 


Common uses. The Studentized maximum modulus is commonly used in post hoc multiple 
comparisons for GLM and ANOVA. 


Distribution and Special Functions 


Functions. The CDF and IDF functions are available, and are computed by approximation. 


The Studentized maximum modulus distribution has CDF 


CO 
F(a) = f [20(ru) —1)]'"dg, (u) 
0 
where @(-) and ®(-) are the PDF and CDF of the standard normal distribution and 


dg, (u) = trey’ exp (—vu?/2) du 


The IDF does not have a closed form. The CDF can be computed by using numerical integration. 
The inverse CDF can be found by solving F(x) = p numerically for given p. 
Studentized Range 


The Studentized range distribution takes values in the range x>0 and has a number of samples 
parameter, k, and degrees of freedom parameter, v, both of which must be greater than or equal to 1. 


Common uses. The Studentized range is commonly used in post hoc multiple comparisons 
for GLM and ANOVA. 


Functions. The CDF and IDF functions are available, and are computed by approximation. 


The Studentized range distribution has CDF 


Oo CO 
F(x)= f ff ad(t)[®(t)-—(t — ru)]*~'dedg,, (uw) 
Q —c 


where @(-) and ® (-) are the PDF and CDF of the standard normal distribution and 
dg, (u) = telex (—vu?/2) du 


The IDF does not have a closed form. Both the CDF and IDF have to be computed numerically 
(see the following references). 


References. AS 190, plus correction and remark. (See the Algorithm Index and References.) 


Student’s t 


The Student t distribution takes real values and has one degrees of freedom parameter, v, which 
must be positive. The Student t distribution is symmetric about 0. 


Common uses. The major uses of the Student t distribution are to test hypotheses and construct 
confidence intervals for means of data. 


Functions. The CDF, IDF, PDF, RV, NCDF, and NPDF functions are available. 


Distribution and Special Functions 


The t distribution has PDF, CDF, and IDF 


i 5 2\7+th)/2 
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where B (a,b) = ae —x)’'dr is the beta function and 
ae 0 
IB (x:a,b) = f e t7-1 (1 —t)>- dt is the incomplete beta function. 
9 Bla,b) 


Relationship to other distributions. 


m@ The ¢(v) distribution is the distribution of X/Y, where X is a normal(0,1) variate and Y is a 
chi-square(v) variate divided by v. 


m The square of a t(v) variate has an F(1,v) distribution. 


The t(v) distribution approaches the normal(0,1) distribution as v approaches infinity. 
Random Number Generation 
Special case 
If a=1, generate X from a Cauchy (0, 1) distribution. 


Using the normal and the chi-square distributions 
1. Generate Z from N(0,1) and V from Chi-square(a) independently. 
2. Set x — Z//V/a 
References. CDF: AS 3 (1968), AS 27 (1970), and CACM 395 (1970); ICDF: CACM 396 
(1970). 
(See the Algorithm Index and References.) 


Uniform 


The uniform distribution takes values in the range a<x<b and has a minimum value parameter, a, 
and a maximum value parameter, b. 


Functions. The CDF, IDF, PDF, and RV functions are available. 


The uniform distribution has PDF, CDF, and IDF 


F(2;a,b) = i 
F7\(p;a,b) =a+(b—a)p 


Random Number Generation 


Inverse CDF algorithm 
1. Generate U from Uniform(0,1). 


2. Set ¥ =a+(b—a)U. 


References. Uniform of (0,1) is generated by the method in (Schrage, 1979). 


Weibull 


The Weibull distribution takes values in the range x>=0 and has one scale parameter, B, and one 
shape parameter, a, both of which must be positive. 


Common uses. The Weibull distribution is commonly used in survival analysis. 
Functions. The CDF, IDF, PDF, and RV functions are available. 

The Weibull distribution has PDF, CDF, and IDF 

f(x;8,a) =$% (3) aie)" 

F(x; B,a) = 1—e7"/9)" 

F-"(p; 8,0) = 8(—In(1—p))'/° 


Relationship to other distributions. 


mg A Weibull(6,1) distribution is equivalent to an exponential(B) distribution. 
Random Number Generation 


Inverse CDF algorithm 


1. Generate U from Uniform(0,1). 


2, Set x =9(—In(1—v))'” 


Discrete Distributions 


These are functions of a single variable that takes integer values. 


Distribution and Special Functions 


Bernoulli 


The Bernoulli distribution takes values 0 or 1 and has one success probability parameter, 6, which 
must be between 0 and 1, inclusive. 


Functions. The CDF, PDF, and RV functions are available. 


The Bernoulli distribution has PDF and CDF 


Relationship to other distributions. 


m The Bernoulli distribution is a special case of the binomial distribution and is used in simple 
success-failure experiments. 


Random Number Generation 


Special case 


If a=0, X=0. If a=1, X=1. 


Direct algorithm for 0<a<1 
1. Generate U from Uniform(0,1). 


2. Set X=1 if U<a (a success) and X=0 if U>a (a failure). 


Binomial 


The binomial distribution takes integer values 0<=x<=n, representing the number of successes 
in n trials, and has one number of trials parameter, n, and one success probability parameter, 0. 
Parameter n must be a positive integer and parameter 8 must be between 0 and 1, inclusive. 


Common uses. The binomial distribution is used in independently replicated success-failure 
experiments. 


Functions. The CDF, PDF, and RV functions are available. 


The binomial distribution has PDF and CDF 


if. (a;n,@) = ( . Jer = aye 


— IB(G; —x) #=0,1,...,.n— 
F («n,0) = i IB(0;x + 1,n —«) e=01nnn—1 
\ i iu I . ig 4 
where IB (xr; a,b) =i Bl pe — +)’~‘dt is the incomplete beta function. 
0 a,O 


Distribution and Special Functions 


Random Number Generation 
Special case 
Ifb=0,X=0. Ifb=1,X= a. 


Algorithm BB due to Ahrens and Dieter (1974) for 0 < b<1 


1. Setc=a,d=b,k =0,y =0,andh = 1. 
2. If c<40, generate J from Binomial(c, d) using algorithm BU. X=k+J. 


3. Ifc is odd, go to step 4. If c is even, set c=c—1 and generate U from Uniform(0,1). If U<d, 
set k=k+1. 


4. Set s=(c+1)/2 and generate S from Beta(s, s). Set G=hs and Z=y+G. 
5. If Z<b, set y= Z,h =h—d,d=(b— Z)/h,andk = k + s; otherwise set h = Gandd = (b — y)/h. 
6. Set c= s-—1 and gotostep 2. 


Computation time for algorithm BB is O(log a). 


References. RV: (Ahrens et al., 1974). 


Geometric 


The geometric distribution takes integer values x>=1, representing the number of trials needed 
(including the last trial) before a success is observed, and has one success probability parameter, 
0, which must be between 0 and 1, inclusive. 


Functions. The CDF, PDF, and RV functions are available. 
The geometric distribution has PDF and CDF 

f (a0) =0(1 —0)""! 

F (#;0) =1-—(1-8)* 


Relationship to other distributions. 


m The geometric(@) distribution is equivalent to the negative binomial (1,6) distribution. 
Random Number Generation 
Special case 
If a=1, X=1. 
Direct algorithm for 0 <a<1 
1. Set X=1. 


2. Generate U from Uniform(0,1). 


Distribution and Special Functions 


3. If U>a, set X=X+1 and go to step 2; otherwise finish. 


Hypergeometric 


The hypergeometric distribution takes integer values in the range max(0, 
Np+n-N)<=x<=min(Np,n), and has three parameters, N, n, and Np, where N is the total number 
of objects in an urn model, n is the number of objects randomly drawn without replacement from 
the urn, Np is the number of objects with a given characteristic, and x is the number of objects 
with the given characteristic observed out of the withdrawn objects. All three parameters are 
positive integers, and both n and Np must be less than or equal to N. 


Functions. The CDF, PDF, and RV functions are available. 


The hypergeometric distribution has PDF and CDF 


("? ( N- 2) 
f(x;N,n, Np) = += = =e 
n 


F (x;N,n, Np) = ye Prob( X=h:) 


k= max(0,n+ Np—N) 


Random Number Generation 


Special case 


If b=a, X=c. If c=a, X=b. 


Direct algorithm 
1. (Initialization) X=0, g=c, h=b, t=a. 
2. Do the following loop exactly b times: 
Begin Loop 
i. Generate U from Uniform(0,1). 
i. IfU < (g/t) set X =X +1,g=g-—1,elseh=h—-1. 
iii. If g=0, go to step 3. 
iv. If h=0, set X = X + b —i, where i (from 1 to b) is the loop index. Go to step 3. 
v. Set t=t-1. 
End Loop 
3. Finish. 


References. CDF: AS 152 (1989), AS R77 (1989), and AS R86 (1991). (See the “Algorithm 
Index” and “References”) 


Distribution 


and Special Functions 


Negative Binomial 


1. 
2: 


3. 


Poisson 


The negative binomial distribution takes integer values in the range x>=r, where x is the number 
of trials needed (including the last trial) before r successes are observed, and has one threshold 
parameter, r, and one success probability parameter, 8. Parameter r must be a positive integer and 
parameter 8 must be greater than 0 and less than or equal to 1. 


Functions. The CDF, PDF, and RV functions are available. 


The negative binomial distribution has PDF and CDF 


A aoe! Ea Wee wor 
f (air,0) = ( re (0) 
F (a;r,@) = IB(@;r,2 —r +1) 


b-1 


where IB (x; a,b) = i 


# PML st 
9 B(a,b) = 


dt is the incomplete beta function. 


Relationship to other distributions. 


m The negative binomial(1,8) distribution is equivalent to the geometric(8) distribution. 
Random Number Generation 

Special case 

If b=1, X=a. 

Direct algorithm 

Generate G from Gamma(a, b/(1—b)). 

If G=0, go to step 1. Otherwise generate P from Poisson (G). 


Compute X=P+a. 


The Poisson distribution takes integer values in the range x>=0 and has one rate or mean 
parameter, A. Parameter 1 must be positive. 


Common uses. The Poisson distribution is used in modeling the distribution of counts, such as 
traffic counts and insect counts. 


Functions. The CDF, PDF, and RV functions are 
available. The Poisson distribution has PDF and CDF 
f (vA) = a aA 


F (a#;A) = 1— IG(\yt1) 


Distribution and Special Functions 


1 


Tia) ‘edt is the incomplete gamma function. 
a 


where IG (2°;a) = / 
0 


Random Number Generation 
Algorithm PG due to Ahrens and Dieter (1974) 
1. (Initialization) Set X=0 and w=a. 
2. Ifw>16, go to step 6. 
3. Set c=exp(—w) and p=1. 
4. Generate U from Uniform(0,1). Set p=pU. 
5. If p<c, continue with step 6; otherwise set X=X+1 and go to step 4. 
6. Set n = [7w/8}. Generate G from Gamma(n, 1). 
7. If G>w, generate Y from Binomial(n—1, w/G), set X=X+Y. 
8. If G<w, set Y = XY +n,w = w —G, and go to step 2. 
Notes. [y] means the integer part of y. 


Steps 3 to 5 of Algorithm PG are in fact the direct algorithm. 


References. RV: (Ahrens et al., 1974). 


Special Functions 


These are not distribution functions, but are used in the functional definition of one or more 
distributions. 


Gamma Function 


T'(a) =a x —te"dx a>0O 
0 


The gamma function has the following properties: 


mw T(1)=1 

mw T(1/2)= Vr 

m@ T(a)=(a—1)F(a-1) a>l 

m I(a) =(a—1)! when ais a positive integer 


Note. Since I'(a) can be very large even for a moderate value of a, the (natural) logarithm of 
T'(a) is computed instead. 


References. The In(I(a)) function: CACM 291 (1966) and AS 245 (1989). (See the “Algorithm 
Index” and “References”) 


Distribution and Special Functions 
Beta Function 
1 
B(a,b) = ret. a) fda a>0.b>0 
0 


The beta function has the following properties: 


B(a,b) =['(a)I'(b)/P(a + 5) 
B(b,a) = B(a,b) 

B(a,b) = (b— L)B(a+1,6—1)/a 
B(a,b) = (a + b)B(a + 1,6)/a 


Note. Usually, B(x, y) is calculated as: 


B(x, y) = exp (In (T(a)) + In (T(y)) — n(T(a + y))) 


Incomplete Gamma Function (Ratio) 


IG (rsa) = [ A 49-le“tdt > 0 
0 T(a 


a) 
Lp = IG"! (p;a) & p=IG(2,,a)0<p<l 


for a>0 


The incomplete gamma function has the following properties: 
@ IG(#;1)=1-e* 


m= Using integration by parts, for a>1 


IG(zx;a) = meee + 1G(z;a+1) 


Note. ie) 1,a) =o. 


References. AS 32 (1970), AS 147 (1980), and AS 239 (1988). (See the “Algorithm Index” 
and “References”) 


Incomplete Beta Function (Ratio) 


° 1 =; b-1 = 2 
IB (x;a,b) = a=lit) do < es 1 
(x;a,b) i Blab) ( yo dt0 <a < 
Lp = IB-? (p3a,b) & p=IB(ap;a.b|)0<p<l 
fora > Oandb > 0 
The incomplete beta function has the following properties: 
m@ IB(r;a,1)= 2" 


m Using the transformation t = sin*0, we get IB(x; 5,4) = 2sin7' x 


Distribution and Special Functions 


IB(x;a,b) =1—-(1 —2;b,a) 
Using integration by parts, we get, for b>1, 
IB(7;a,b) = Ber?) r?(1— z)} + IB(r:a+1,b-— 1) 


P(aj)P(b)* 


t 


PS ea b)a@(1 —x)’~' we have 


m Using the fact that “(1 — xr)’ =ax*-"(1— 2) 


IB(ax; a,b) = ety et (1 _ a)? + IB(x:a+1.b) 


References. AS 63 (1973); Inverse: AS 64 (1973), AS 109 (1977). (See the “Algorithm Index” 
and “References”) 


Standard Normal 


e 1 —u?/2 = vie 
P(r) = —¢ du- oo <2 <0 
co V 27 


Lp = }—!(p) & ®(r) =p 
For @~1, the Abramowitz and Stegun method is used. 


References. AS 66 (1973); Inverse: AS 111 (1977) and AS 241 (1988). See (Patel and Read, 
1982) for related distributions, and see the “Algorithm Index” and “References”. 


Algorithm Index 
AS 3: (Cooper, 1968)a 
AS 5: (Cooper, 1968)b 
AS 27: (Taylor, 1970) 
AS 32: (Bhattacharjee, 1970) 
AS 63: (Majumder and Bhattacharjee, 1973)a 


AS 64: (Majumder and Bhattacharjee, 1973)b 


Distribution and Special Functions 
AS 66: (Hill, 1973) 
AS 76: (Young and Minder, 1974) 
AS 91: (Best and Roberts, 1975) 
AS 109: (Cran, Martin, and Thomas, 1977) 
AS 111: (Beasley and Springer, 1977) 
AS 134: (Atkinson and Whittaker, 1979) 
AS 147: (Lau, 1980) 
AS 152: (Lund, 1980) 
AS 170: (Narula and Desu, 1981) 
AS 190: (Lund and Lund, 1983) , Correction (Lund and Lund, 1985), Remark (Royston, 1987) 
AS 195: (Schervish, 1984) 
AS 226: (Lenth, 1987) 
AS 231: (Farebrother, 1987) 
AS 239: (Shea, 1988) 
AS 241: (Wichura, 1988) 
AS 243: (Lenth, 1989) 
AS 245: (Macleod, 1989) 
AS 275: (Ding, 1992) 
AS 462: (Donnelly, 1973) 
AS R85: Shea (1991) 
CACM 291: (Pike and Hill, 1966) 
CACM 299: (Hill and Pike, 1967) 
CACM 332: (Dorrer, 1968) 


CACM 395: (Hill, 1970)a 


CACM 396: (Hill, 1970)b 


Distribution and Special Functions 


CACM 451: (Goldstein, 1973) 


CACM 488: (Brent, 1974) 
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Box’s M Test 


Appendix 


Box’s M statistic is used to test for homogeneity of covariance matrices. The jth set of r 


dependent variables in the ith cell are Y i =x ;;B +e ;; where e;; ~ N,.(0, wj,°S ) for i=1,....g 
and j =1, ...,/;. The null hypothesis of the test for homogeneity of covariance matrices is 
H,:%, = +++» = X,. Box (1949) derived a test statistic based on the likelihood-ratio test. The 


test statistic is called Box’s M statistic. For moderate to small sample sizes, an F approximation 


is used to compute its significance. 


Box’s M statistic is not designed to be used in a linear model context; therefore the observed 


cell means are used in computing the statistic. 


Note: Although Anderson (Anderson, 1958) mentioned that the population cell means can be 
expressed as linear combinations of parameters, he assumed that the combination coefficients are 


different for different cells, which is not the model assumed for GLM. 


Notation 
The following notation is used throughout this chapter unless otherwise stated: 
Table J-1 
Notation 
Notation Description 
g Number of cells with non-singular covariance matrices. 
ni Number of cases in the ith cell. 
n Total sample size, n = ni + --- + 1g 
Yij The jth set of dependent variables in the ith cell. A column vector of length r. 
Wij Regression weight associated with y;;. Itis assumed w;; > 0. 
Means 
Vi = Ue yi /ni 


Cell Covariance Matrix 
g. — f UL wilyig FY —¥,) (ri) if mp >1 
: oO ifn; <1 
Pooled Covariance Matrix 


S= vf (ni; —1)8;/(n-g) ifn>g 
~ 10 ifn <g 


Box’s M Test 


Box’s M Statistic 


g 
M= (n — g) log |S] — 2| n; —1)log|S;| if |S] >0 
SYSMIS if |S] <0 
Significance 


1—CDF-F(yM, fi, f2) 


where CDE.F is the IBM® SPSS® Statistics function for the cumulative F distribution and 


fi =e + 1)/2 


g 1 
-1 ot 
p aes 10s, (n; —1) (n—g) 
i=1 
g 
ita 1 
(r—1)(r+2) 
T 2 
6(g-1) we (n; = ty" 


The significance is a system-missing value whenever the denominator is zero in the above 
expression. 
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Appendix 


Confidence Intervals for 
Percentages and Counts 
Algorithms 


This document describes the algorithms for computing confidence intervals for percentages 

and counts for bar charts. The data are assumed to be from a simple random sample, and each 
confidence interval is a separate or individual interval, based on a binomial proportion of the total 
count. The computed binomial intervals are equal-tailed Jeffreys prior intervals (see Brown, 

Cai, & DasGupta, 2001, 2002, 2003). Note that they are generally not symmetric around the 
observed proportion. Therefore, the plotted interval bounds are generally not symmetric around 
the observed percentage or count. 


Notation 

The following notation is used throughout this section unless otherwise noted: 
Table K-1 
Notation 

Notation Description 

Xj Distinct values of the category axis variable Xj 

Wi Rounded sum of weights for cases with value 

weSu Total sum of weights over values of X 

_ 
Pi Population proportion of cases at Xj 


Specified error level for 100(1- * )% confidence intervals 


IDF.BETA(p,shapel,shape2) in COMPUTE gives the p‘h quantile of the beta distribution or 
incomplete beta function with shape parameters shapel and shape2. For a precise mathematical 
definition, see “Beta Function”. 


Confidence Intervals for Counts 
Lower bound for W pj = W [IDF.BETA(*/2,wi +.5,W—wi +.5)]. 


Upper bound for W pj = W [IDF.BETA(1-*/2,wi +.5,W—wi +.5)]. 


Standard error for Wpj= tv x V( wi /W)(1 = (w;/W))/W) 


Confidence Intervals for Percentages 
Lower bound for 100 p; = 100 [IDF.BETA(*/2,wi +.5,W—wi +.5)]. 


Upper bound for 100 p; = 100 [IDF.BETA(1-*/2,wi +.5,W—wi +.5)]. 


Confidence Intervals for Percentages and Counts Algorithms 


Standard error for pj= 100 x = \/(w;/W)(1— (wi /W))/W) 
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Notices 


This information was developed for products and services offered worldwide. 


IBM may not offer the products, services, or features discussed in this document in other countries. 
Consult your local IBM representative for information on the products and services currently 
available in your area. Any reference to an IBM product, program, or service is not intended to 
state or imply that only that IBM product, program, or service may be used. Any functionally 
equivalent product, program, or service that does not infringe any IBM intellectual property right 
may be used instead. However, it is the user’s responsibility to evaluate and verify the operation 
of any non-IBM product, program, or service. 


IBM may have patents or pending patent applications covering subject matter described in this 
document. The furnishing of this document does not grant you any license to these patents. 
You can send license inquiries, in writing, to: 


IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785, 
U.S.A. 


For license inquiries regarding double-byte character set (DBCS) information, contact the IBM 
Intellectual Property Department in your country or send inquiries, in writing, to: 


Intellectual Property Licensing, Legal and Intellectual Property Law, IBM Japan Ltd., 1623-14, 
Shimotsuruma, Yamato-shi, Kanagawa 242-8502 Japan. 


The following paragraph does not apply to the United Kingdom or any other country 
where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS 
MACHINES PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY 
KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS 
FORA 

PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties 
in certain transactions, therefore, this statement may not apply to you. 


This information could include technical inaccuracies or typographical errors. Changes are 
periodically made to the information herein; these changes will be incorporated in new editions 
of the publication. IBM may make improvements and/or changes in the product(s) and/or the 
program(s) described in this publication at any time without notice. 


Any references in this information to non-IBM Web sites are provided for convenience only and 
do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites 
are not part of the materials for this IBM product and use of those Web sites is at your own risk. 


IBM may use or distribute any of the information you supply in any way it believes appropriate 
without incurring any obligation to you. 


Licensees of this program who wish to have information about it for the purpose of enabling: (i) the 
exchange of information between independently created programs and other programs (including 
this one) and (ii) the mutual use of the information which has been exchanged, should contact: 


IBM Software Group, Attention: Licensing, 233 S. Wacker Dr., Chicago, IL 60606, USA. 


Notices 


Such information may be available, subject to appropriate terms and conditions, including in 
some cases, payment of a fee. 


The licensed program described in this document and all licensed material available for it are 
provided by IBM under terms of the IBM Customer Agreement, IBM International Program 
License Agreement or any equivalent agreement between us. 


Information concerning non-IBM products was obtained from the suppliers of those products, 
their published announcements or other publicly available sources. IBM has not tested those 
products and cannot confirm the accuracy of performance, compatibility or any other claims 
related to non-IBM products. Questions on the capabilities of non-IBM products should be 
addressed to the suppliers of those products. 


This information contains examples of data and reports used in daily business operations. 

To illustrate them as completely as possible, the examples include the names of individuals, 
companies, brands, and products. All of these names are fictitious and any similarity to the names 
and addresses used by an actual business enterprise is entirely coincidental. 


If you are viewing this information softcopy, the photographs and color illustrations may not 
appear. 
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IBM, the IBM logo, ibm.com, and SPSS are trademarks of IBM Corporation, registered in 
many jurisdictions worldwide. A current list of IBM trademarks is available on the Web at 
http://www.ibm.com/legal/copytrade.shtml. 


Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or 
trademarks of Adobe Systems Incorporated in the United States, and/or other countries. 
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Corporation or its subsidiaries in the United States and other countries. 


Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the 
United States, other countries, or both. 


Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. 


Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft 
Corporation in the United States, other countries, or both. 


UNIX is a registered trademark of The Open Group in the United States and other countries. 


This product uses WinWrap Basic, Copyright 1993-2007, Polar Engineering and Consulting, 
http://www.winwrap.com. 


Other product and service names might be trademarks of IBM or other companies. 
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